Large Language Model

LLM(Large Language Model) is running on top of HolmesAI infrastructure.

We provide a list of popular LLM models, You can create our Automatic load balancing services to run them, And feed your apps or customers with our OpenAI compatiple APIs.

Before started

You should get some parameters before get started : SERVICE_ID, API_KEY and MODEL, you can find them on our dashbord (opens in a new tab).

API reference

/v1/chat/completions

Request Parameters

Parameter name		Type	Description	Required
model		String	model type	yes
messages		Array	A list of messages comprising the conversation so far.	yes
	role	String	The role of the messages author:system,user,assistant,tool	yes
	content	String	The contents of the system message.	yes
temperature		Float	What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic. We generally recommend altering this or top_p, but not both.	No
top_p		Float	An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass. So 0.1 means only the tokens comprising the top 10% probability mass are considered.	No
presence_penalty		Float	Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model's likelihood to talk about new topics.	No
stop		Array	Up to 4 sequences where the API will stop generating further tokens.	No
max_tokens		Int	The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API.	No
stream		Bool	If set, partial message deltas will be sent, like in ChatGPT. Tokens will be sent as data-only server-sent events as they become available, with the stream terminated by a data: [DONE] message.	No

Response Elements

Parameter name			Type	Description
id			String	Session id
created			Int64	The Unix timestamp (in seconds) of when the chat completion was created.
choices			[Object]	A list of chat completion choices.Can be more than one if n is greater than 1.
	index		Int	The index of the choice in the list of choices
	message		Object	A chat completion message generated by the model.
		role	String	The role of the author of this message.
		content	String	The contents of the message.
	finish_reason		String	The reason the model stopped generating tokens. This will be `stop` if the model hit a natural stop point or a provided stop sequence, `length` if the maximum number of tokens specified in the request was reached, `content_filter` if content was omitted due to a flag from our content filters,`tool_calls` if the model called a tool, or `function_call` (deprecated) if the model called a function.
usage			Object	Usage statistics for the completion request.
	prompt_tokens		Int64	Number of tokens in the prompt.
	completion_tokens		Int64	Number of tokens in the generated completion.
	total_tokens		Int64	Total number of tokens used in the request (prompt + completion).

Usage

python

pip install -U openai

import os
import openai
 
client = openai.OpenAI(
    base_url="https://modelapi.holmesai.xyz./$SERVICE_ID/api/v1",
    api_key="$API_KEY")
)
 
completion = client.chat.completions.create(
    model="$MODEL",
    messages=[
        {"role": "user", "content": "say hello"},
    ],
    max_tokens=128,
    stream=True,
)
 
for chunk in completion:
    if not chunk.choices:
        continue
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="")

curl

curl https://modelapi.holmesai.xyz./$SERVICE_ID/api/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
     "model": "$MODEL",
     "messages": [{"role": "user", "content": "say hello"}],
     "max_tokens": 128
   }'

Provider Audio Processing Models