Tutorial
This tutorial demonstrates how to use the In-Process SDK of LLMBoost so that you can integrated it into your own application.
If you plan on using LLMBoost only as an inference server feel free to skip straight to Inference Server
If you are using one of the Cloud-based LLMBoost images the contents of the code listings can be found in tutorial.py
.
First we import the relevant packages:
# Copyright 2024, MangoBoost, Inc. All rights reserved.
import pickle
import time
from llmboost import LLMBoost
import evaluate
import nltk
from transformers import AutoTokenizer
from tqdm import tqdm
import numpy as np
from absl import app
The LLMBoost runtime is imported using from llmboost import LLMBoost
.
The tutorial is an absl
application leveraging the LLMBoost runtime which performs text completion.
The inputs and expected outputs are taken from the open ORCA dataset.
At the end of inference the ROUGE score is calculated for the 1000 samples.
As standard in absl
applications we start with main
and __main__
declarations:
def main(_):
...
if __name__ == "__main__":
app.run(main)
All of the code from this point on is placed in the def main(_)
function.
This snippet below reads a pre-processed version of the open ORCA dataset (saved in a pkl
file) and extracts the system and user prompts in addition to the expected results.
If you are interested in how the dataset was processed please refer to this resource.
# The dataset is a pickled pandas DataFrame with the following columns:
# - system_prompt: A system-generated prompt.
# - question: A user-generated question.
# - output: The reference answer to the question.
dataset_path = "./llama2_data.pkl"
with open(dataset_path, "rb") as f:
dataset = pickle.load(f)
source_texts = []
for _, row in dataset.iterrows():
system = row.get("system_prompt", None)
user = row.get("question", None)
source_texts.append({"system": system, "user": user})
target_texts = dataset["output"].tolist()
print(f"Loaded dataset with {len(source_texts)} samples")
Now it is time to setup the LLMBoost runtime. LLMBoost is an optimized inference runtime that exposes multiple levels of parallelism. By default LLMBoost to try to determine the best parallelism to maximize throughput, however Data Parallelism and Tensor Parallelism can be control directly, please refer to the API documentation on how to do this. For additional information on Data Parallelism and Tensor Parallelism please refer to the section on Parallelism
In this snippet we name our target model, in this case microsoft/Phi-3-mini-4k-instruct
, and specify the max_tokens
we want to generate.
Finally we call llmboost.start()
to prepare the runtime to start accepting inference requests.
model_name = "microsoft/Phi-3-mini-4k-instruct"
# Instantiate LLMBoost with the specified parameters.
llmboost = LLMBoost(model_name=model_name, max_tokens=1024)
# Start the LLMBoost runtime.
llmboost.start()
LLMBoost supports loading models in multiple formats including download models directly from HuggingFace.
Now that the LLMBoost runtime has been started we can format the inputs and issue them to the LLMBoost runtime.
In the case of the Open ORCA dataset there is a system
and user
message and the exact format of of these prompts often depends on the particular model.
LLMBoost attempts apply the model specific format with llmboost.apply_format(...)
, alternatively a user can build their own format and submit the requests as a list of requests (List[Dict[int, str]]
).
Since LLMBoost is design for streaming LLM Inference there are separate thread-safe function calls for issuing requests and collecting results.
In a production environment we recommend using different threads for issuing and collecting.
The llmboost.get_output()
function is blocking request that will sleep until data is available.
For more information on the specifics, please refer to the API documentation.
num_prompts = 1000
# The apply_format function formats the input text segments into the required format.
inputs = []
for i, p in enumerate(source_texts[:num_prompts]):
inputs.append(
llmboost.apply_format(
[
{"role": "system", "content": p["system"]},
{"role": "user", "content": p["user"]},
]
)
)
# Assign an id to each input for tracking the output.
inputs[-1]["id"] = i
# Issue the inputs to LLMBoost.
# The input format is a list of dictionaries with the following keys:
# - id: A unique identifier for the input.
# - val: The input text.
llmboost.issues_inputs(inputs)
This next snippet collects the results from LLMBoost as they become available. LLMBoost supports streaming of incomplete results, this tutorial doesn't cover this aspect so please refer to the API documentation if you are interested.
# Query LLMBoost for the outputs.
# The outputs are returned as a list of dictionaries with the following keys:
# - id: The unique identifier of the input.
# - val: The output text.
# - finished: A flag indicating if the output is complete. (Only relevant for streaming mode)
preds = {}
out_count = 0
while out_count < len(inputs):
outputs = llmboost.get_output()
for q in outputs:
preds[q["id"]] = q["val"]
out_count += 1
Now that all of the results have been collected we can stop the runtime.
llmboost.stop()
The tutorial.py
provided with the LLMBoost release also has additional code to measure performance, this has been left out to only focus on the parts relevant to run LLMBoost.
This final section compares the output of LLMBoost to the expected output and calculates the ROUGE score.
# Sort the outputs based on the id.
preds = [preds[key] for key in sorted(preds.keys(), reverse=False)]
# Calculate the ROUGE scores.
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
metric = evaluate.load("rouge")
# Do some postprocessing to the texts.
preds = [pred.strip() for pred in preds]
target_texts = [target.strip() for target in target_texts[0 : len(preds)]]
# rogueLSum expects newline after each sentence.
preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
target_texts = ["\n".join(nltk.sent_tokenize(target)) for target in target_texts]
# Calculate the ROUGE scores.
result = metric.compute(
predictions=preds,
references=target_texts,
use_stemmer=True,
use_aggregator=False,
)
result = {k: round(np.mean(v) * 100, 4) for k, v in result.items()}
print(f"ROUGE1: {result['rouge1']}")
print(f"ROUGE2: {result['rouge2']}")
print(f"ROUGEL: {result['rougeL']}")
print(f"ROUGELsum: {result['rougeLsum']}")