Skip to main content

Chat Completions

POST /v1/chat/completions

Generate a chat completion. This endpoint is fully OpenAI-compatible and supports both streaming and non-streaming responses.

Authentication

Optional. Required when allow_network_access is enabled.

Authorization: Bearer <your-api-key>

Request Body

FieldTypeRequiredDefaultDescription
modelstringYesModel ID to use for completion
messagesarrayYesArray of message objects
temperaturenumberNo1.0Sampling temperature, between 0 and 2
top_pnumberNo1.0Nucleus sampling threshold
max_tokensintegerNoModel defaultMaximum number of tokens to generate
streambooleanNofalseEnable streaming via Server-Sent Events

Message Object

FieldTypeRequiredDescription
rolestringYesOne of system, user, or assistant
contentstringYesThe message content

Non-Streaming Response

{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1713100800,
"model": "llama-3.1-8b-q4",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you today?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 12,
"completion_tokens": 9,
"total_tokens": 21
}
}

Streaming Response

When stream: true, the response is delivered as Server-Sent Events. Each event is a data: line containing a JSON chunk, terminated by data: [DONE].

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1713100800,"model":"llama-3.1-8b-q4","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1713100800,"model":"llama-3.1-8b-q4","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1713100800,"model":"llama-3.1-8b-q4","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1713100800,"model":"llama-3.1-8b-q4","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Streaming Chunk Object

FieldTypeDescription
idstringCompletion ID (same across all chunks)
objectstringAlways chat.completion.chunk
createdintegerUnix timestamp
modelstringModel ID used
choicesarrayArray with one choice object
choices[].indexintegerChoice index
choices[].deltaobjectIncremental content (role and/or content)
choices[].finish_reasonstring or nullstop on final chunk, null otherwise

Examples

Non-streaming request

curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-q4",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"temperature": 0.7,
"max_tokens": 256
}'

Streaming request

curl http://localhost:11435/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-3.1-8b-q4",
"messages": [
{"role": "user", "content": "Write a haiku about computers."}
],
"stream": true
}'

Using the OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:11435/v1",
api_key="not-needed"
)

response = client.chat.completions.create(
model="llama-3.1-8b-q4",
messages=[
{"role": "user", "content": "Explain quantum computing in one sentence."}
]
)

print(response.choices[0].message.content)

Streaming with the OpenAI Python SDK

from openai import OpenAI

client = OpenAI(
base_url="http://localhost:11435/v1",
api_key="not-needed"
)

stream = client.chat.completions.create(
model="llama-3.1-8b-q4",
messages=[
{"role": "user", "content": "Tell me a joke."}
],
stream=True
)

for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")

Error Responses

StatusError TypeDescription
400invalid_requestMissing or invalid parameters
401authentication_errorInvalid or missing API key
404not_foundRequested model not available
503service_unavailableNo model loaded or node is busy