Start K8s Cluster
LLMBoost supports scalable multi-node deployment using Kubernetes (K8s). This guide walks you through configuring and launching a Kubernetes-based LLMBoost cluster using the built-in auto-deployment tooling.
đĄ Kubernetes and container orchestration will be auto-configured by the deployment scripts. You do not need to pre-install Kubernetes manually on each node.
Prerequisitesâ
Before running the deployment script, ensure the following are installed on the machine that will control the deployment:
- LLMBoost Docker
- Access to the
/workspace/cluster
directory within LLMBoost docker - SSH access (with password-less login) to all target nodes
- SSH key must not be password protected
Step 1: Configure Cluster Settings (config.json
)â
Navigate to the /workspace/cluster
directory. Then, please modify the config.json
.
This file defines model settings, node topology, and deployment roles. Below are the required fields:
Field | Description |
---|---|
vars.hf_token | Your HuggingFace token for downloading models |
vars.model_name | The model to load (e.g., "meta-llama/Llama-3.2-1B-Instruct" ) |
node_config.nodes.<name>.private_ip | Internal IP used for intra-node communication |
node_config.nodes.<name>.public_ip | Public IP or hostname for SSH access |
node_config.common_ssh_option | Global SSH options (e.g., User , IdentityFile ) used by all nodes |
manager_node | Name of the control plane node from nodes |
worker_nodes | List of node names (can include manager_node ) to deploy LLMBoost workers |
The following shows an example of config.json
file.
{
"vars": {
"hf_token": "<PUT-YOUR-HF-TOKEN-HERE>",
"model_name": "meta-llama/Llama-3.2-1B-Instruct"
},
"node_config":
{
"common_ssh_option":
{
"User": "suyoung.choi",
"IdentityFile": "~/.ssh/id_rsa"
},
"nodes":
{
"node-0":
{
"private_ip": "10.4.16.1",
"public_ip": "10.4.16.1"
},
"node-1": {
"private_ip": "10.4.16.2",
"public_ip": "10.4.16.2"
}
},
"manager_node": "node-0",
"worker_nodes":
[
"node-0",
"node-1"
]
}
}
Step 2: Deploying the Clusterâ
Once SSH access and config.json
file are ready, navigate to the /workspace/cluster
directory and run:
make install
This will set up Kubernetes, installs dependencies, and initializes the manager and worker nodes. Then, deploy LLMBoost services across the cluster
make deploy
Step 3: Accessing the LLMBoost Serverâ
Once the cluster is deployed, the LLMBoost inference server will be exposed on port 30080
of each worker node.
You can connect to it from any machine with network access to the node using the CLI:
llmboost client --host <endpoint IP> --port 30080
Replace <worker-node-ip>
with the public IP of any active worker node.
You can use the sample Python client below to distribute requests across multiple nodes. This is useful for benchmarking or testing your deployment in parallel.
Sample Python Client (client.py
)
import argparse
import threading
from queue import Queue
from openai import OpenAI
# Define prompt list
prompts = [
"how does multithreading work in Python?",
"Write me a Fibonacci generator in Python",
"Which pet should I get, dog or cat?",
"How do I fine-tune an LLM model?"
]
# Thread worker for sending requests
def run_thread(host, queue: Queue):
client = OpenAI(
base_url=f"http://{host}/v1",
api_key="-"
)
while not queue.empty():
prompt = queue.get()
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
stream=False,
max_tokens=1000
)
print(
f"-------------------------------------------------------------------\n"
f"Question: {prompt}\nAnswer:{chat_completion.choices[0].message.content}"
)
# Argument parsing
parser = argparse.ArgumentParser()
parser.add_argument("--hosts", nargs="+", help="list of servers <server_ip>:<port>")
args = parser.parse_args()
threads = []
queue = Queue()
# Populate the request queue
for prompt in prompts:
queue.put(prompt)
# Launch threads for each host
for host in args.hosts:
t = threading.Thread(target=run_thread, args=(host, queue))
threads.append(t)
t.start()
# Wait for all threads to complete
for thread in threads:
thread.join()
To run the client:
python client.py --hosts 10.4.16.1:30080 10.4.16.2:30080
This command distributes the prompts across the specified worker nodes, demonstrating LLMBoost's ability to scale inference over a cluster.
Load Balancing (Current Status)â
âšī¸ Note: Automatic load balancing is not yet enabled. You must explicitly distribute requests across worker node IPs (as done in the client script above). A load balancer or proxy (like NGINX) can be added in front for automated request routing.