LLM Ops
LLM Ops is a deployment service for LLMs that facilitates the deployment of LLMs on GPUs. It provides an API designed to support both chat and completion requests, accommodating streaming and non-streaming requests alike. The foundation of the service rests on the FastChat platform. For model deployment on GPUs, LLM Ops utilizes the vllm serving engine.
Model Deployment
Requirements
- Docker
Setup
The service is dockerized, enabling straightforward deployment through a single Docker command. To start the service, navigate to the llm-ops
directory and run:
docker-compose up -d
Deploying a Model
Currently, deploying a new model requires the model to be explicitly included in the docker-compose file. Below is a demonstration of deploying the LLaMA-2 7b chat model as an illustrative example.
llm_chat:
build:
context: .
container_name: llm_chat
volumes:
- /home/hf_models:/root/.cache/huggingface # replace "/home/hf_models" with your own huggingface models directory
deploy:
resources:
reservations:
devices: # adjust this based on the specification of your machine
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint:
- /bin/bash
- ./start_chat.sh
command:
- --model-path
- ../root/.cache/huggingface/Llama-2-7b-chat # falcon-7b-instruct # Llama-2-7b-chat # vicuna-7b-v1.3
labels:
- "traefik.enable=true"
- "traefik.http.routers.llm-chat.rule=PathPrefix(`/api/Llama-2-7b-chat`)" # API path of you model. Adjust it base on the model you deploy
- "traefik.http.routers.llm-chat.entrypoints=websecure"
- "traefik.http.routers.llm-chat.tls=true"
- "traefik.http.routers.llm-chat.tls.certresolver=le"
- "traefik.http.routers.llm-chat.middlewares=llm-chat-stripprefix,llm-chat-addprefix"
- "traefik.http.middlewares.llm-chat-stripprefix.stripPrefixRegex.regex=/api/[a-zA-Z0-9_-]+"
- "traefik.http.middlewares.llm-chat-addprefix.addPrefix.prefix=/api"
Supported Models
Currently, the following models are supported by default:
- LLama2
- Vicuna v1.1
- Dolly V2
- Falcon 180B
- Falcon
- Mistral-instruct-v0.1
If you want to add support to a new model, you have to consider the following:
- The model has to be supported by vllm. See: Supported Models — vLLM
- If you want to support a chat model, you also have to add a new
conv_template
inllm-ops/llm_ops/prompts/conversation.py
. For example, here is how to add the conv_template for the LLama2 chat model:
register_conv_template(
Conversation(
name="llama-2",
system_template="[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n",
roles=("[INST]", "[/INST]"),
sep_style=SeparatorStyle.LLAMA2,
sep=" ",
sep2=" </s><s>",
)
)
API
After starting the service with your model being deployed, you can make non-streaming or streaming requests. The following are examples of how to make requests to the deployed model Llama-2-7b-chat.
Non-Streaming Request
curl -k -X POST https://localhost:8443/api/Llama-2-7b-chat/worker_generate \
-H "Content-Type: application/json" \
-d '{
"model_identifier": "Llama-2-7b-chat",
"messages": [
{
"role": "user",
"text": "Hellow!"
},
{
"role": "ai",
"text": "Hey! How can I help you today?"
},
{
"role": "user",
"text": "Tell me a short funny joke."
}
],
"system_message": "The following is a friendly conversation between a human and an AI.",
"temperature": 0.7,
"top_p": 0.9,
"echo": false,
"generation_mode": "chat"
}'
generation_mode
can be either chat
or completion
depending on the type of request you want to make. If you want to make a completion request, you have to set generation_mode
to completion
and provide a string prompt
instead of messages
.
Streaming Request
The streaming request is very similar to the non-streaming request, but you have to use the endpoint /api/Llama-2-7b-chat/worker_generate_stream
instead of /api/Llama-2-7b-chat/worker_generate
. I replace the messages
field with a prompt
field just to show how to make a completion request too.
curl -k -X POST https://localhost:8443/api/Llama-2-7b-chat/worker_generate_stream \
-H "Content-Type: application/json" \
-d '{
"model_identifier": "Llama-2-7b-chat",
"prompt": "Hellow! Can you tell me a joke?",
"system_message": "The following is a friendly conversation between a human and an AI.",
"temperature": 0.7,
"top_p": 0.9,
"echo": false,
"generation_mode": "completion"
}'
Note that in both non-streaming and streaming requests, you have to provide the model_identifier
, prompt
or messages
and generation_mode
. The other fields are optional. The echo
field is used to determine whether the service should return the initial prompt/messages or not.