Skip to main content

Start K8s Cluster

LLMBoost supports scalable multi-node deployment using Kubernetes (K8s). This guide walks you through configuring and launching a Kubernetes-based LLMBoost cluster using the built-in auto-deployment tooling.

💡 Kubernetes and container orchestration will be auto-configured by the deployment scripts. You do not need to pre-install Kubernetes manually on each node.

Prerequisites​

Before running the deployment script, ensure the following are installed on the machine that will control the deployment:

  • LLMBoost Docker
  • Access to the /workspace/cluster directory within LLMBoost docker
  • SSH access (with password-less login) to all target nodes
  • SSH key must not be password protected

Step 1: Configure Cluster Settings (config.json)​

Navigate to the /workspace/cluster directory. Then, please modify the config.json.

This file defines model settings, node topology, and deployment roles. Below are the required fields:

FieldDescription
vars.hf_tokenYour HuggingFace token for downloading models
vars.model_nameThe model to load (e.g., "meta-llama/Llama-3.2-1B-Instruct")
node_config.nodes.<name>.private_ipInternal IP used for intra-node communication
node_config.nodes.<name>.public_ipPublic IP or hostname for SSH access
node_config.common_ssh_optionGlobal SSH options (e.g., User, IdentityFile) used by all nodes
manager_nodeName of the control plane node from nodes
worker_nodesList of node names (can include manager_node) to deploy LLMBoost workers

The following shows an example of config.json file.

{
"vars": {
"hf_token": "<PUT-YOUR-HF-TOKEN-HERE>",
"model_name": "meta-llama/Llama-3.2-1B-Instruct"
},
"node_config":
{
"common_ssh_option":
{
"User": "suyoung.choi",
"IdentityFile": "~/.ssh/id_rsa"
},
"nodes":
{
"node-0":
{
"private_ip": "10.4.16.1",
"public_ip": "10.4.16.1"
},
"node-1": {
"private_ip": "10.4.16.2",
"public_ip": "10.4.16.2"
}
},
"manager_node": "node-0",
"worker_nodes":
[
"node-0",
"node-1"
]
}
}

Step 2: Deploying the Cluster​

Once SSH access and config.json file are ready, navigate to the /workspace/cluster directory and run:

make install  

This will set up Kubernetes, installs dependencies, and initializes the manager and worker nodes. Then, deploy LLMBoost services across the cluster

make deploy    

Step 3: Accessing the LLMBoost Server​

Once the cluster is deployed, the LLMBoost inference server will be exposed on port 30080 of each worker node.

You can connect to it from any machine with network access to the node using the CLI:

llmboost client --host <endpoint IP> --port 30080

Replace <worker-node-ip> with the public IP of any active worker node.

Need to test multiple nodes at once?

You can use the sample Python client below to distribute requests across multiple nodes. This is useful for benchmarking or testing your deployment in parallel.

Sample Python Client (client.py)

import argparse
import threading
from queue import Queue
from openai import OpenAI

# Define prompt list
prompts = [
"how does multithreading work in Python?",
"Write me a Fibonacci generator in Python",
"Which pet should I get, dog or cat?",
"How do I fine-tune an LLM model?"
]

# Thread worker for sending requests
def run_thread(host, queue: Queue):
client = OpenAI(
base_url=f"http://{host}/v1",
api_key="-"
)
while not queue.empty():
prompt = queue.get()
chat_completion = client.chat.completions.create(
model="meta-llama/Llama-3.2-1B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt}
],
stream=False,
max_tokens=1000
)
print(
f"-------------------------------------------------------------------\n"
f"Question: {prompt}\nAnswer:{chat_completion.choices[0].message.content}"
)

# Argument parsing
parser = argparse.ArgumentParser()
parser.add_argument("--hosts", nargs="+", help="list of servers <server_ip>:<port>")
args = parser.parse_args()

threads = []
queue = Queue()

# Populate the request queue
for prompt in prompts:
queue.put(prompt)

# Launch threads for each host
for host in args.hosts:
t = threading.Thread(target=run_thread, args=(host, queue))
threads.append(t)
t.start()

# Wait for all threads to complete
for thread in threads:
thread.join()

To run the client:

python client.py --hosts 10.4.16.1:30080 10.4.16.2:30080

This command distributes the prompts across the specified worker nodes, demonstrating LLMBoost's ability to scale inference over a cluster.

Load Balancing (Current Status)​

â„šī¸ Note: Automatic load balancing is not yet enabled. You must explicitly distribute requests across worker node IPs (as done in the client script above). A load balancer or proxy (like NGINX) can be added in front for automated request routing.