Skip to main content

Inference Quick Start

This guide walks you through setting up and running a quick LLM inference benchmark using LLMBoost on either AMD or NVIDIA GPUs.

Before running any inference workloads, you'll need to set up the environment where LLMBoost will operate. This involves pulling the correct Docker image for your GPU backend (AMD or NVIDIA) and preparing environment variables such as model paths and access tokens. LLMBoost is packaged in a containerized format to ensure consistency, portability, and optimized performance across different systems.

Step 1: Pull LLMBoost docker Image

To get access to the LLMBoost Docker image, please contact the MangoBoost team.

docker pull mangollm/<llmboost-docker-image-name>:prod-cuda

Note: Replace <llmboost-docker-image-name> with the image name provided by MangoBoost.

Step 2: Set up environment

Sett the following environment variables on the host node.

export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.

  • Set the model directory MODEL_PATH with the absolute path to the directory on your host file system where your local models are stored.
  • Set the license file path LICENSE_FILE to your license file location. Please contact us through [email protected] if you don't have a llmboost license.
  • Set the HuggingFace token HUGGING_FACE_HUB_TOKEN by obtaining a Hugging Face token from huggingface.co/settings/tokens.

Step 3: Run a benchmark in LLMBoost container

Use the following command to launch the LLMBoost container to run an inference benchmark on an NVIDIA/AMD GPU. The benchmark will evaluate the throughput and latency of the Meta-Llama/Llama-3.2-1B-Instruct.

docker run -it --rm \
--network host \
--gpus all \
--pid=host \
--group-add video \
--ipc host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $MODEL_PATH:/workspace/models \
-v $LICENSE_FILE:/workspace/llmboost_license.skm \
-w /workspace \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
<llmboost-docker-image-name>:prod-cuda \
bash -c "python3 apps/benchmark.py \
--model_name Meta-Llama/Llama-3.2-1B-Instruct \
--input_len 128 --output_len 128"

Expected Output

Once the benchmark completes, you should see performance metrics like this:

Initializing LLMBoost...
Preparing model with 8192 context length...
Applying auto parallelization...
INFO 05-30 22:49:02 [__init__.py:239] Automatically detected platform rocm.
config.json: 100%|█████████████████████████████████████████████| 877/877 [00:00<00:00, 12.0MB/s]
Deploying LLMBoost (this may take a few minutes) .............................................|
I0530 22:51:52.703913 140310129037312 benchmark.py:170] Starting LLMBoost with 1000 inputs
benchmark.py: 100%|████████████████████████████████████████████| 1000/1000 [00:07<00:00, 131.04req/s]
I0530 22:52:00.335985 140310129037312 benchmark.py:256] LLMBoost Finished
I0530 22:52:00.336340 140310129037312 benchmark.py:287] Total time: 1.0858112429850735 seconds
I0530 22:52:00.336368 140310129037312 benchmark.py:288] Throughput: 235768.42 tokens/s
I0530 22:52:00.336387 140310129037312 benchmark.py:289] 920.97 reqs/s
I0530 22:52:00.336404 140310129037312 benchmark.py:290] Prompt : 117884.21 tokens/s
I0530 22:52:00.336420 140310129037312 benchmark.py:291] Generation: 117884.21 tokens/s
I0530 22:52:00.336437 140310129037312 benchmark.py:292] TTFT : 1.0453 seconds
I0530 22:52:00.336452 140310129037312 benchmark.py:293] TBT : 0.0000 seconds

Metrics Explained:

  • Throughput: Total number of tokens processed per second
  • Requests/sec: Number of prompt requests completed per second
  • TTFT (Time-To-First-Token): Latency until the first token is generated
  • TBT (Time-Between-Tokens): Latency between token generations

Next Steps

  • Compare the performance of different models by modifying the --model_name flag in the docker run command.
  • Explore the available deployment and integration options in the How-To Guides