Deploying Retrieval-Augmented Generation

LLMBoost supports Retrieval-Augmented Generation (RAG) on top of its standard LLM inference endpoint. With a simple llmboost serve command, LLMBoost will automatically index your specified documents into a vector database, set up RAG generation pipelines, and serve Retrieval-Augmented Generation on the specified endpoint. This tutorial sets up and queries the LLMBoost RAG pipelines with an example that uses medical datasets for RAG.

Step 0: Before you start

First, set up some directories for the RAG dataset and vector database storage with the following commands. The /mnt/data/MedRAG folder will contain the dataset that is used in this example. Next, we create the /mnt/data/milvus folder, which our vector database uses for storage. Later in the RAG deep dive, we will demonstrate how to set up your own custom datasets for your RAG application.

sudo mkdir -p /mnt/data
sudo mkdir -p /mnt/data/MedRAG
sudo mkdir -p /mnt/data/milvus
sudo chmod 777 /mnt/data
sudo chmod 777 /mnt/data/MedRAG
sudo chmod 777 /mnt/data/milvus

Now, start the LLMBoost container. We do this by setting relevant environment variables and running a docker run command. This is where we will run our LLMBoost RAG servers from.

On NVIDIA
On AMD

export MODEL_PATH=<absolute_path_to_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.

Set the model directory MODEL_PATH with the absolute path to the directory on your host file system where your local models are stored.
Set the license file path LICENSE_FILE to your license file location. Please contact us through [email protected] if you don't have a llmboost license.
Set the HuggingFace token HUGGING_FACE_HUB_TOKEN by obtaining a Hugging Face token from huggingface.co/settings/tokens.

# AMD
export MODEL_DIR_PATH=<absolute_path_to_local_model_directory>
export LICENSE_FILE=<absolute_path_to_license_file>
export HUGGING_FACE_HUB_TOKEN=<your_huggingface_token>

💡 These variables are used when launching the Docker container to ensure correct model loading and authentication.

On NVIDIA
On AMD

docker run -it --rm \
  --network host \
  --gpus all \
  --pid=host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  -v $MODEL_DIR_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -v /mnt/data:/mnt/data \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-cuda \
  bash

docker run -it --rm \
  --network host \
  --group-add video \
  --ipc host \
  --cap-add=SYS_PTRACE \
  --security-opt seccomp=unconfined \
  --device=/dev/dri:/dev/dri \
  --device=/dev/kfd:/dev/kfd \
  -v $MODEL_DIR_PATH:/workspace/models \
  -v $LICENSE_FILE:/workspace/llmboost_license.skm \
  -v /mnt/data:/mnt/data \
  -w /workspace \
  -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
  <llmboost-docker-image-name>:prod-rocm \
  bash

Note: Replace <llmboost-docker-image-name> with the image name provided by MangoBoost.

Next, copy two scripts from the LLMBoost container to your host and run them. The first script (run_milvus_standalone.sh) is used to start the vector database. The second script (download.py) is used to download the RAG document dataset. Run these commands in the host from the /mnt/data directory you created earlier.

# Navigate to data/
cd /mnt/data

# From your host machine, find the container ID of the LLMBoost container
docker ps
export DOCKER_ID=<llmboost_container_id>

# Copy the scripts from the LLMBoost container to your host machine
docker cp ${DOCKER_ID}:/workspace/rag/run_milvus_standalone.sh run_milvus_standalone.sh
docker cp ${DOCKER_ID}:/workspace/rag/download.py download.py

# Run the start script to start the RAG vector database
bash run_milvus_standalone.sh start /mnt/data/milvus

# Run the download script on the host to download the document dataset
huggingface-cli login
python3 download.py --parent_dir /mnt/data/MedRAG --textbooks

Next, clone the MIRAGE question dataset into /mnt/data if it is not cloned already. Run these commands in the host from the /mnt/data directory you created earlier.

# Navigate to persist_db/
cd /mnt/data

# Clone the question dataset
git clone https://github.com/Teddy-XiongGZ/MIRAGE.git

Now, the document dataset and question dataset should be available on your host machine.

Step 1: Start the RAG service

The following command will start both the LLMBoost inference endpoint and the RAG query endpoint. The command will also automatically index the example textbook dataset into the Milvus vector database before bringing up the RAG query endpoint. Run these commands in the container from the /workspace directory.

# Navigate to workspace/
cd /workspace

# For a node with AMD GPUs
llmboost serve --rag \
  --rag_config_path /workspace/rag/configs/rocm.yaml \
  --model_path /workspace/models/Llama-3.1-8B-Instruct \
  --hf_token <your_hf_token> \
  --tp 1 \
  --dp 8
# Or, for a node with NVIDIA GPUs
llmboost serve --rag \
  --rag_config_path /workspace/rag/configs/cuda.yaml \
  --model_path /workspace/models/Llama-3.2-1B-Instruct \
  --hf_token <your_hf_token> \
  --tp 1 \
  --dp 4

These commands rely on RAG configuration files such as /workspace/rag/configs/rocm.yaml, which is shown below.

rag_embedding_model: sentence-transformers/all-mpnet-base-v2
gpu_backend: rocm
num_gpus: 8

document_split_by: period
document_split_length: 5
document_split_overlap: 0
document_split_threshold: 2

drop_old: true
random_seed: 42

max_num_indexing_processes: 8
max_num_querying_processes: 8
max_num_indexing_threads: 1
max_num_querying_threads: 1

indexing_batch_size: 128
querying_batch_size: 128

indexing_queue_size: 1024
querying_queue_size: 1024

indexing_cpu_only: false
querying_cpu_only: false

prompt_format: default_prompt

Each of the fields of the RAG configuration file control how the RAG server operates, and are described below.

rag_embedding_model : select which embedding model from HuggingFace to use
gpu_backend : select which GPU backend to use for the embedding models (cuda, rocm)
num_gpus : specify the number of GPUs in your system
document_split_by : decide how to split your documents by (word, period, line)
document_split_overlap : decide how much overlap different document chunks should have with each other, in units of document_split_by
document_split_threshhold : decide how big a document has to be before it is instead appended to a larger document, in units of document_split_by
drop_old : specify whether to drop old entries from the vector database
random_seed : random seed used when running the RAG benchmark
max_num_indexing_processes : specify how many independent document indexing processes to run in the RAG server
max_num_indexing_threads : specify how many threads per document indexing process to run in the RAG server
max_num_querying_processes : specify how many independent RAG querying processes to run in the RAG server
max_num_querying_threads : specify how many threads per RAG querying process to run in the RAG server
indexing_batch_size : specify the batch size for indexing workloads
querying_batch_size : specify the batch size for querying workloads
indexing_queue_size : specify the worker queue size for the indexing server
querying_queue_size : specify the worker queue size for the querying server
indexing_cpu_only : specify whether to run indexing embedding models on CPU instead of GPU
querying_cpu_only : specify whether to run querying embedding models on CPU instead of GPU
prompt_format : specify a prompt format which is used when augmenting prompts with retrieved document text

Step 2: Check the RAG endpoint

The following curl commands run a health check and execute an example query on both the inference endpoint and RAG endpoint, respectively. Run them from anywhere on your host machine.

curl -s http://127.0.0.1:8011/status | jq .

# ----- EXAMPLE OUTPUT -----
{
  "status": "running",
  "server_name": "/models/models/Llama-3.1-8B-Instruct"
}
# ----- EXAMPLE OUTPUT -----

curl -s -X POST http://127.0.0.1:8011/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
           "model": "Llama-3.1-8B-Instruct",
           "messages":[{"role":"user","content":"Hello, who are you?"}]
         }' | jq '.choices[0].message.content'

# ----- EXAMPLE OUTPUT -----
"<|start_header_id|>assistant<|end_header_id|>\n\nI'm an artificial intelligence model known as a large language model. I'm a computer program designed to understand and generate human-like text. I don't have a personal identity or physical presence; I exist solely as a digital entity, and my purpose is to assist and communicate with users like you through text-based conversations.\n\nI can answer questions, provide information, offer suggestions, and engage in discussions on a wide range of topics. My knowledge is based on the data I was trained on, which includes a massive corpus of text from the internet, books, and other sources. I'm constantly learning and improving my abilities based on user interactions and updates to my training data.\n\nFeel free to ask me anything, and I'll do my best to help!\n"
# ----- EXAMPLE OUTPUT -----

curl -s http://127.0.0.1:8012/rag/health

# ----- EXAMPLE OUTPUT -----
{"status":"ok"}
# ----- EXAMPLE OUTPUT -----

curl -s -X POST http://127.0.0.1:8012/rag/query \
     -H "Content-Type: application/json" \
     -d '{"text":"What is hypertension?"}' | jq .

# ----- EXAMPLE OUTPUT -----
{
  "context": {
    "start_time": 1748485090.4882593,
    "end_time": 1748485111.5286136,
    "response": {
      "generator": {
        "replies": [
          "<|start_header_id|>assistant<|end_header_id|>\n\nHypertension, also known as high blood pressure, is a medical condition in which the blood pressure in the arteries is persistently elevated. The blood pressure in the arteries is the force exerted by the blood flowing through them, and normal blood pressure is typically measured as between 90/60 mmHg and 120/80 mmHg. In hypertension, the top number (systolic blood pressure) is higher and/or the bottom number (diastolic blood pressure) is higher than this range, causing strain on the blood vessels, heart, and other organs.\n\nHypertension is often a silent condition, as it frequently does not produce symptoms in its early stages. However, if left unchecked, it can lead to serious health complications, such as heart attack, stroke, kidney disease, and vision loss. The exact cause of hypertension is often unknown, but risk factors include genetics, diet, lifestyle, and underlying medical conditions. Treatment typically involves lifestyle modifications such as diet, exercise, and potential medication to manage and control blood pressure levels.\n"
        ],
        "meta": [
          {
            "model": "/models/models/Llama-3.1-8B-Instruct",
            "index": 0,
            "finish_reason": null,
            "usage": {
              "completion_tokens": 0,
              "prompt_tokens": 0,
              "total_tokens": 0,
              "completion_tokens_details": null,
              "prompt_tokens_details": null
            }
          }
        ]
      }
    }
  }
}
# ----- EXAMPLE OUTPUT -----

Step 0: Before you start​

Step 1: Start the RAG service​

Step 2: Check the RAG endpoint​

Step 0: Before you start

Step 1: Start the RAG service

Step 2: Check the RAG endpoint