Large Language Model
LLM(Large Language Model) is running on top of HolmesAI infrastructure.
We provide a list of popular LLM models, You can create our Automatic load balancing services to run them, And feed your apps or customers with our OpenAI compatiple APIs.
Before started
You should get some parameters before get started : SERVICE_ID
, API_KEY
and MODEL
, you can find them on our dashbord (opens in a new tab).
API reference
/v1/chat/completions
Request Parameters
Parameter name | Type | Description | Required | |
---|---|---|---|---|
model | String | model type | yes | |
messages | Array | A list of messages comprising the conversation so far. | yes | |
role | String | The role of the messages author:system,user,assistant,tool | yes | |
content | String | The contents of the system message. | yes | |
temperature | Float | What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p, but not both. | No | |
top_p | Float | An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered. | No | |
presence_penalty | Float | Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics. | No | |
stop | Array | Up to 4 sequences where the API will stop generating further tokens. | No | |
max_tokens | Int | The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API. | No | |
stream | Bool | If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message. | No |
Response Elements
Parameter name | Type | Description | ||
---|---|---|---|---|
id | String | Session id | ||
created | Int64 | The Unix timestamp (in seconds) of when the chat completion was created. | ||
choices | [Object] | A list of chat completion choices.Can be more than one if n is greater than 1. | ||
index | Int | The index of the choice in the list of choices | ||
message | Object | A chat completion message generated by the model. | ||
role | String | The role of the author of this message. | ||
content | String | The contents of the message. | ||
finish_reason | String | The reason the model stopped generating tokens. This will be stop if the model hit a natural stop point or a provided stop sequence, length if the maximum number of tokens specified in the request was reached, content_filter if content was omitted due to a flag from our content filters,tool_calls if the model called a tool, or function_call (deprecated) if the model called a function. | ||
usage | Object | Usage statistics for the completion request. | ||
prompt_tokens | Int64 | Number of tokens in the prompt. | ||
completion_tokens | Int64 | Number of tokens in the generated completion. | ||
total_tokens | Int64 | Total number of tokens used in the request (prompt + completion). |
Usage
python
pip install -U openai
import os
import openai
client = openai.OpenAI(
base_url="https://modelapi.holmesai.xyz./$SERVICE_ID/api/v1",
api_key="$API_KEY")
)
completion = client.chat.completions.create(
model="$MODEL",
messages=[
{"role": "user", "content": "say hello"},
],
max_tokens=128,
stream=True,
)
for chunk in completion:
if not chunk.choices:
continue
content = chunk.choices[0].delta.content
if content:
print(content, end="")
curl
curl https://modelapi.holmesai.xyz./$SERVICE_ID/api/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "$MODEL",
"messages": [{"role": "user", "content": "say hello"}],
"max_tokens": 128
}'