Start K8s Cluster

LLMBoost supports scalable multi-node deployment using Kubernetes (K8s). This guide walks you through configuring and launching a Kubernetes-based LLMBoost cluster using the built-in auto-deployment tooling.

💡 Kubernetes and container orchestration will be auto-configured by the deployment scripts. You do not need to pre-install Kubernetes manually on each node.

Prerequisites

Before running the deployment script, ensure the following are installed on the machine that will control the deployment:

LLMBoost Docker
Access to the /workspace/cluster directory within LLMBoost docker
SSH access (with password-less login) to all target nodes
SSH key must not be password protected

Step 1: Configure Cluster Settings (`config.json`)

Navigate to the /workspace/cluster directory. Then, please modify the config.json.

This file defines model settings, node topology, and deployment roles. Below are the required fields:

Field	Description
`vars.hf_token`	Your HuggingFace token for downloading models
`vars.model_name`	The model to load (e.g., `"meta-llama/Llama-3.2-1B-Instruct"`)
`node_config.nodes.<name>.private_ip`	Internal IP used for intra-node communication
`node_config.nodes.<name>.public_ip`	Public IP or hostname for SSH access
`node_config.common_ssh_option`	Global SSH options (e.g., `User`, `IdentityFile`) used by all nodes
`manager_node`	Name of the control plane node from `nodes`
`worker_nodes`	List of node names (can include `manager_node`) to deploy LLMBoost workers

The following shows an example of config.json file.

{
    "vars": {
        "hf_token": "<PUT-YOUR-HF-TOKEN-HERE>",
        "model_name": "meta-llama/Llama-3.2-1B-Instruct"
    },
    "node_config":
    {
        "common_ssh_option":
        {
            "User": "suyoung.choi",
            "IdentityFile": "~/.ssh/id_rsa"
        },
        "nodes":
        {
            "node-0":
            {
                "private_ip": "10.4.16.1",
                "public_ip": "10.4.16.1"
            },
            "node-1": {
                "private_ip": "10.4.16.2",
                "public_ip": "10.4.16.2"
            }
        },
        "manager_node": "node-0",
        "worker_nodes":
        [
            "node-0",
            "node-1"
        ]
    }
}

Step 2: Deploying the Cluster

Once SSH access and config.json file are ready, navigate to the /workspace/cluster directory and run:

make install

This will set up Kubernetes, installs dependencies, and initializes the manager and worker nodes. Then, deploy LLMBoost services across the cluster

make deploy

Step 3: Accessing the LLMBoost Server

Once the cluster is deployed, the LLMBoost inference server will be exposed on port 30080 of each worker node.

You can connect to it from any machine with network access to the node using the CLI:

llmboost client --host <endpoint IP> --port 30080

Replace <worker-node-ip> with the public IP of any active worker node.

Need to test multiple nodes at once?

You can use the sample Python client below to distribute requests across multiple nodes. This is useful for benchmarking or testing your deployment in parallel.

Sample Python Client (client.py)

import argparse
import threading
from queue import Queue
from openai import OpenAI

# Define prompt list
prompts = [
    "how does multithreading work in Python?",
    "Write me a Fibonacci generator in Python",
    "Which pet should I get, dog or cat?",
    "How do I fine-tune an LLM model?"
]

# Thread worker for sending requests
def run_thread(host, queue: Queue):
    client = OpenAI(
        base_url=f"http://{host}/v1",
        api_key="-"
    )
    while not queue.empty():
        prompt = queue.get()
        chat_completion = client.chat.completions.create(
            model="meta-llama/Llama-3.2-1B-Instruct",
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            stream=False,
            max_tokens=1000
        )
        print(
            f"-------------------------------------------------------------------\n"
            f"Question: {prompt}\nAnswer:{chat_completion.choices[0].message.content}"
        )

# Argument parsing
parser = argparse.ArgumentParser()
parser.add_argument("--hosts", nargs="+", help="list of servers <server_ip>:<port>")
args = parser.parse_args()

threads = []
queue = Queue()

# Populate the request queue
for prompt in prompts:
    queue.put(prompt)

# Launch threads for each host
for host in args.hosts:
    t = threading.Thread(target=run_thread, args=(host, queue))
    threads.append(t)
    t.start()

# Wait for all threads to complete
for thread in threads:
    thread.join()

To run the client:

python client.py --hosts 10.4.16.1:30080 10.4.16.2:30080

This command distributes the prompts across the specified worker nodes, demonstrating LLMBoost's ability to scale inference over a cluster.

Prerequisites​

Step 1: Configure Cluster Settings (config.json)​

Step 2: Deploying the Cluster​

Step 3: Accessing the LLMBoost Server​

Prerequisites

Step 1: Configure Cluster Settings (`config.json`)

Step 2: Deploying the Cluster

Step 3: Accessing the LLMBoost Server