Skip to main content

llmboost

LLMBoost Objects

class LLMBoost()

LLMBoost is a ready-to-deploy, full stack AI inferencing server offering unprecedented performance, cost efficiency and flexibility.

This class is used to interact with the LLMBoost Runtime. It provides methods to start, stop, issue inputs, get outputs, set the query type and format the inputs.

__init__

def __init__(model_name,
query_type="text",
tp=0,
dp=0,
max_num_seqs=256,
max_model_len=None,
load_format="auto",
quantization=None,
kv_cache_dtype="auto",
quantization_param_path=None,
quantization_weight_path=None,
enable_async_output=False,
streaming=False,
beam_width=0,
max_tokens=1024,
drain_per_worker=False,
load_balancing_mode="auto",
first_gpu_id=0,
vllm_args={})

Initialize the LLMBoost Runtime.

Arguments:

  • model_name str - The name of the model to be used. This can be a model from the Hugging Face Model Hub or a custom model.
  • query_type str, List[Int] - The type of query to be used. This can be either "text" or "tokens".
  • tp int - Tensor Parallelism.
  • dp int - Data Parallelism.
  • max_num_seqs int - Maximum number of sequences to be processed in parallel.
  • max_model_len int - Maximum number of model tokens.
  • load_format str - The format in which the model is loaded. The default is "auto" but you can specify either "pt" and "safetensor" if you desire.
  • quantization str - The quantization method to be used. The default is None but you can specify "fp8" to use floating point 8-bit quantization.
  • kv_cache_dtype str - The data type to be used for the key-value cache. The default is "auto" but you can specify "fp8" to use floating point 8-bit quantization.
  • quantization_param_path str - The path to the quantization parameters.
  • quantization_weight_path str - The path to the quantization weights.
  • enable_async_output bool - This enable asyncio support for the output queue.
  • streaming bool - This enables streaming mode.
  • beam_width int - The beam width to be used for beam search.
  • max_tokens int - The maximum number of tokens to be generated.
  • drain_per_worker bool - This enables draining per worker, this can improve latency in rare cases but it not recommended.
  • load_balancing_mode str - The load balancing mode to be used. The default is "auto" but you can specify "passthrough" or "scatter".
  • first_gpu_id int - The first GPU ID to be used.
  • vllm_args dict - The arguments to be passed to the VLLM.

Example:

Create an LLMBoost Object with Llama-3.1-70B-Instruct model with default parameters.

llm = LLMBoost(model_name="meta-llama/Llama-3.1-70B-Instruct")

start

def start()

Start the LLMBoost Runtime.

Example:

Start the LLMBoost Runtime.

llm.start()

stop

def stop()

Stop the LLMBoost Runtime.

Example:

Stop the LLMBoost Runtime.

llm.stop()

aget_output

async def aget_output()

Asyncio compatible get_output method.

Returns:

  • Dict - The output.

Example:

Get the output from the LLMBoost Runtime.

out = await llm.aget_output()

get_output

def get_output(num=int(1e9))

Get the output from the LLMBoost Runtime. The output dictionary is in the format:

{
"id": int,
"val": str,
"finished": dict
}

Arguments:

  • num int - The number of outputs to get.

Returns:

  • List[Dict] - The list of outputs.

Example:

Get the output from the LLMBoost Runtime.

out = llm.get_output()

for o in out:
# Is the request finished?
print(o["finished"])

# Print the id
print(o["id"])

# Print the predicted text
print(o["val"])

This is a simple example of how one would use the streaming outputs:

preds = {}

out_count = 0
while out_count < len(inputs):
outputs = llmboost.get_output()
for o in outputs:
preds[o["id"]] = preds.get(o["id"], "") + o["val"]
out_count += o["finished"]

<a id="llmboost.LLMBoost.issues_inputs"></a>

#### issues\_inputs

```python
def issues_inputs(inputs)

Issues inputs to the LLMBoost Runtime.

The input dictionary is in the format:

{
"id": id,
"val": str
}

Arguments:

  • inputs Dict, List[Dict] - The inputs to be issued.

Example:

Issue inputs to the LLMBoost Runtime.

text = "What is the largest planet in the solar system?"
llm.issues_inputs({"id": 1, "val": text})

apply_format

def apply_format(chat)

Apply the model specific format to the chat.

The input chat dictionary is in the format:

The output dictionary is in the format:

{
"role": str,
"content": str
}
{
"id": int,
"val": str
}

Arguments:

  • chat str, Dict, List[Dict] - The chat to be formatted.

Returns:

  • Dict - The formatted chat.

Example:

Apply the model specific format to the chat.

llm = LLMBoost(model_name="meta-llama/Llama-3.1-70B-Instruct")
formatted_chat = llm.apply_format(
[
{'role': 'system', 'content': 'You are a chatbot.'},
{'role': 'user', 'content': 'What is the largest planet in the solar system?'}
]

print(formatted_chat)
>>> {'id': 123456789, 'val': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>You are a chatbot.<|eot_id|><|start_header_id|>user<|end_header_id|>What is the largest planet in the solar system?<|eot_id|>><|start_header_id|>assistant<|end_header_id|>'}

set_query_type

def set_query_type(query_type="tokens")

This method is used to set the query type.

Arguments:

  • query_type str - The query type to be used. This can be either "text" or "tokens".

Example:

Set the query type to "tokens".

llm.set_query_type("tokens")