Skip to main content

Using Python Library

Here's a minimal example to run LLMBoost with the Python Library:

note

Before we begin, make sure you are authenticated with the HuggingFace CLI. Follow HF Authentication Guide for more details.

1. Quick Start​

In the first sample below, we are using get_output() which gives per-prompt output granularity. In the second sample, we will use aget_output() which will stream the output tokens as they arrive.

Want more details?

For a step-by-step explanation, please go straight to Tutorial.

Sample #1: Non-streaming Output​

from llmboost import LLMBoost

def main():
llm = LLMBoost(model_name="meta-llama/Llama-3.1-8B-Instruct")
llm.start()
# Prepare formatted input using apply_format
formatted_input = llm.apply_format([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the cutest cat?"}
])
formatted_input["id"] = 0

# Issue the input
llm.issues_inputs([formatted_input])

output_received = False
while not output_received:
outputs = llm.get_output()
for out in outputs:
print(out["val"], end="")
if out["finished"]:
print("\n")
output_received = True
llm.stop()

if __name__ == "__main__":
main()

Sample #2: Streaming Output Tokens​

This minimal example streams the model's output tokens as they're generated, using async I/O.

import asyncio
from llmboost import LLMBoost

async def main():
llm = LLMBoost(
model_name="meta-llama/Llama-3.1-8B-Instruct",
streaming=True,
enable_async_output=True
)
llm.start()

# Format input
formatted_input = llm.apply_format([
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain the benefits of using quantized models."}
])
formatted_input["id"] = 0

# Issue the input
llm.issues_inputs([formatted_input])

# Stream output
final_output = ""
while True:
output = await llm.aget_output()
if isinstance(output, list):
output = output[0]

print(output["val"], end="", flush=True)
final_output += output["val"]

if output.get("finished", False):
print()
break

llm.stop()

if __name__ == "__main__":
asyncio.run(main())

2. Running Sample Scripts​

There are 2 sample scripts which can be found within our docker at /workspace/apps/:

  • benchmark.py
  • accuracy.py

benchmark.py:​

The benchmark application is for measuring the raw performance of LLMBoost on fixed-length input and fixed-length output dummy data.

Example:

python apps/benchmark.py --model_name meta-llama/Llama-3.1-8B-Instruct --input_len 128 --output_len 128 --num_prompts 1000

accuracy.py:​

The accuracy application is for measuring the performance and inference accuracy of LLMBoost on a popular open-source benchmarking dataset.

Example:

python apps/accuracy.py --model_name meta-llama/Llama-3.1-8B-Instruct --num_prompts 1000