Skip to main content

Inference Server

As well as the In-Process SDK, LLMBoost has an Inference Server option that can be started using the llmboost command line option.

Deployment

There are multiple deployment options:

1. Single server (llmboost serve)

This will run a single application server.

llmboost serve --model_path /models/models/Meta-Llama-3.1-8B-Instruct-FP8-KV --port 8011 --dp 8

2. Multi model deployment (llmboost deploy)

LLMBoost supports the deployment of multiple models on a multi-GPU server. To use this feature, you'll need a configuration file that specifies the deployment details for each model. The configuration options are consistent with those used in LLMBoost benchmark runs. An example configuration file is shown below:

common:
kv_cache_dtype: auto
host: 127.0.0.1

models:
- model_name: Llama-3.2-1B-Instruct
model_path: /models/models/Meta-Llama-3.1-8B-Instruct-FP8-KV
port: 8011
tp: 1
dp: 2

- model_name: Llama-3.1-70B-Instruct
model_path: /models/models/Llama-3.1-70B-Instruct
port: 8012
tp: 2
dp: 2

Finally, you can initiate the deployment by running :

llmboost deploy --config /workspace/examples/config/sample_config.yaml

Check deployment status

You can check the LLMBoost instances status by running llmboost status, and will get similar output as below:

+------+------------------------+---------+
| Port | Name | Status |
+------+------------------------+---------+
| 8011 | Llama-3.2-1B-Instruct | running |
| 8012 | Llama-3.1-70B-Instruct | running |
+------+------------------------+---------+

Run Server-side Benchmark

After deploying the models, you can run a benchmark on each model by executing the following command:

llmboost benchmark --port 8011 --num_prompts 100 --input_len 128 --output_len 128 --output_file /workspace/benchmarks/mi300/bench_result.csv 

Explanation: Execute a benchmark on the server that is running on port 8011 by sending 100 prompts, each having an input and output length of 128 tokens, and save the results in /workspace/benchmarks/mi300/bench_result.csv.

Shutdown instance

You can run llmboost shutdown --port XXXX to delete a specific instance. Or, you can use llmboost shutdown --all to shutdown all instances in the current server.

Interactive client

Once you have an LLMBoost instance up and running, you can use llmboost client to connect to it.

llmboost client --port 8011