Tutorial
This tutorial walks you through using the In-Process SDK of LLMBoost, so you can easily integrate it into your own Python application.
If you're planning to integrate LLMBoost through an OpenAI-compatible API, check out the OpenAI API instead.
If you're using a cloud-based LLMBoost image, the code samples in this tutorial are included in the tutorial.py
file.
📦 Importing Packages
Start by importing the necessary libraries:
# Copyright 2024, MangoBoost, Inc. All rights reserved.
import pickle
import time
from llmboost import LLMBoost
import evaluate
import nltk
from transformers import AutoTokenizer
from tqdm import tqdm
import numpy as np
from absl import app
The LLMBoost runtime is imported via
from llmboost import LLMBoost
The tutorial is an absl
application leveraging the LLMBoost runtime which performs text completion.
The inputs and expected outputs are taken from the open ORCA dataset.
After completion, we'll compute the ROUGE score for the generated outputs.
As standard in absl
applications we start with main
and __main__
declarations:
def main(_):
...
if __name__ == "__main__":
app.run(main)
All following code blocks are placed inside the def main(_)
function.
📄 Load the Dataset
This snippet below reads a pre-processed version of the open ORCA dataset (saved in a pkl
file) and extracts the system and user prompts in addition to the expected results.
If you are interested in how the dataset was processed please refer to this resource.
# The dataset is a pickled pandas DataFrame with the following columns:
# - system_prompt: A system-generated prompt.
# - question: A user-generated question.
# - output: The reference answer to the question.
dataset_path = "./llama2_data.pkl"
with open(dataset_path, "rb") as f:
dataset = pickle.load(f)
source_texts = []
for _, row in dataset.iterrows():
system = row.get("system_prompt", None)
user = row.get("question", None)
source_texts.append({"system": system, "user": user})
target_texts = dataset["output"].tolist()
print(f"Loaded dataset with {len(source_texts)} samples")
⚙️ Initialize LLMBoost
LLMBoost supports multiple parallelism strategies (see Parallelism for details). By default, it auto-configures based on the available hardware, but you can control Tensor (tp
) and Data Parallelism (dp
) directly.
LLMBoost Initialization Parameters:
Parameter | Type | Description |
---|---|---|
model_name | str | Name or path to the model |
query_type | str | Input type: "text" , "tokens" , "image" , "video" |
tp | int | Tensor parallelism factor |
dp | int | Data parallelism factor |
max_num_seqs | int | Max concurrent sequences |
max_model_len | int | Max input sequence length |
max_num_batched_tokens | int | Max token batch size during prefill |
enforce_eager | bool | Force PyTorch eager mode |
load_format | str | "auto" , "pt" , "safetensor" |
quantization | str or None | "fp8" or None |
kv_cache_dtype | str | "auto" or "fp8" |
streaming | bool | Enable token-by-token streaming |
enable_async_output | bool | Enable async output queue |
In this snippet we name our target model, in this case meta-llama/Llama-3.2-1B-Instruct
.
We specify tp
and dp
as dummy values of 0
, to let LLMBoost automatically configure the parallelism strategy.
Finally we call llmboost.start()
to prepare the runtime to start accepting inference requests.
model_name = "meta-llama/Llama-3.2-1B-Instruct"
tp = 0 # Let LLMBoost decide
dp = 0
llmboost = LLMBoost(model_name=model_name, tp=tp, dp=dp)
llmboost.start()
LLMBoost supports both models hosted on HuggingFace and local model files in Hugging Face format.