Using Python Library
Here's a minimal example to run LLMBoost with the Python Library:
Before we begin, make sure you are authenticated with the HuggingFace CLI. Follow HF Authentication Guide for more details.
1. Quick Start​
In the first sample below, we are using get_output()
which gives per-prompt output granularity. In the second sample, we will use aget_output()
which will stream the output tokens as they arrive.
For a step-by-step explanation, please go straight to Tutorial.
Sample #1: Non-streaming Output​
from llmboost import LLMBoost
def main():
llm = LLMBoost(model_name="meta-llama/Llama-3.1-8B-Instruct")
llm.start()
# Prepare formatted input using apply_format
formatted_input = llm.apply_format([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the cutest cat?"}
])
formatted_input["id"] = 0
# Issue the input
llm.issues_inputs([formatted_input])
output_received = False
while not output_received:
outputs = llm.get_output()
for out in outputs:
print(out["val"], end="")
if out["finished"]:
print("\n")
output_received = True
llm.stop()
if __name__ == "__main__":
main()
Sample #2: Streaming Output Tokens​
This minimal example streams the model's output tokens as they're generated, using async I/O.
import asyncio
from llmboost import LLMBoost
async def main():
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-8B-Instruct",
streaming=True,
enable_async_output=True
)
llm.start()
# Format input
formatted_input = llm.apply_format([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the benefits of using quantized models."}
])
formatted_input["id"] = 0
# Issue the input
llm.issues_inputs([formatted_input])
# Stream output
final_output = ""
while True:
output = await llm.aget_output()
if isinstance(output, list):
output = output[0]
print(output["val"], end="", flush=True)
final_output += output["val"]
if output.get("finished", False):
print()
break
llm.stop()
if __name__ == "__main__":
asyncio.run(main())
2. Running Sample Scripts​
There are 2 sample scripts which can be found within our docker at /workspace/apps/
:
benchmark.py
accuracy.py
benchmark.py:​
The benchmark
application is for measuring the raw performance of LLMBoost on fixed-length input and fixed-length output dummy data.
Example:
python apps/benchmark.py --model_name meta-llama/Llama-3.1-8B-Instruct --input_len 128 --output_len 128 --num_prompts 1000
accuracy.py:​
The accuracy
application is for measuring the performance and inference accuracy of LLMBoost on a popular open-source benchmarking dataset.
Example:
python apps/accuracy.py --model_name meta-llama/Llama-3.1-8B-Instruct --num_prompts 1000