# Baseten Model APIs — Orthogonal API > Pay-per-use API on Orthogonal. Each call is billed to your Orthogonal balance. > Base API: `https://api.orth.sh/v1/run` · [llms.txt](https://orthogonal.com/llms.txt) · [browse all APIs](https://orthogonal.com/discover) Baseten is a high-performance inference platform for running open source LLMs. This API provides an OpenAI-compatible chat completions endpoint, so it works as a drop-in replacement with any OpenAI SDK or client. Just swap the base URL and API key. Available models include DeepSeek V3 and V3.1, GLM 4.6 and 4.7 from Zhipu, Moonshot's Kimi K2 family (including K2 Thinking with 262k context), and OpenAI's GPT OSS 120B. Several models support controllable reasoning depth via a reasoning_effort parameter. Supports streaming, tool calling, structured outputs, and extended context windows up to 262k tokens. **Verified:** yes ## Access **Run API:** `POST https://api.orth.sh/v1/run` **Auth:** `Authorization: Bearer $ORTHOGONAL_API_KEY` Get an API key at https://orthogonal.com/dashboard/settings/api-keys Every call goes through the unified Run API: send the API `slug`, the endpoint `path`, and the `query`/`body` parameters. The response is `{ "success": true, "price": "", "data": { ... } }`. ## Endpoints ### Chat Completions Send a conversation to a model and get a completion back. Works exactly like the OpenAI chat completions endpoint. Pass messages and a model slug, get a response with the assistant's reply. Supports streaming for real-time token delivery, tool calling for function execution, structured outputs via response_format, and controllable reasoning depth on supported models. `POST /v1/chat/completions` **Estimated cost:** Dynamic — use `"dryRun": true` in the Run API request to check the exact cost before calling. | Field | Type | Required | Description | |-------|------|----------|-------------| | `messages` | array | Yes | Array of message objects, each with a 'role' (system, user, assistant, tool) and 'content' (string or array of content parts). This is the conversation history sent to the model. | | `model` | string | Yes | Model slug to run inference against. Available models: deepseek-ai/DeepSeek-V3-0324 (164k context, reasoning), deepseek-ai/DeepSeek-V3.1 (164k context, reasoning), zai-org/GLM-4.6 (200k context, reasoning), zai-org/GLM-4.7 (200k context, reasoning), moonshotai/Kimi-K2-Instruct-0905 (128k context), moonshotai/Kimi-K2-Thinking (262k context, always-on reasoning), moonshotai/Kimi-K2.5 (262k context), openai/gpt-oss-120b (128k context). Reasoning models support the reasoning_effort parameter for controlling thinking depth. | | `frequency_penalty` | number | No | Penalize tokens based on how often they've appeared so far. Range -2.0 to 2.0. Positive values reduce repetition. Default: 0. | | `logit_bias` | object | No | Map of token IDs to bias values (-100 to 100). Increase or decrease the likelihood of specific tokens appearing in the output. | | `logprobs` | boolean | No | If true, returns the log probabilities of each output token in the response. | | `top_logprobs` | number | No | How many of the most likely tokens (0-20) to return log probabilities for at each position. Requires logprobs to be true. | | `max_tokens` | number | No | Maximum number of tokens to generate in the response. Default is 4096. | | `n` | number | No | Number of completions to generate. Currently only supports 1. | | `presence_penalty` | number | No | Penalize tokens based on whether they've appeared at all. Range -2.0 to 2.0. Positive values encourage the model to explore new topics. Default: 0. | | `response_format` | object | No | Constrain the output format. Use {"type": "json_object"} for JSON mode, or {"type": "json_schema", "json_schema": {"name": "...", "schema": {...}}} for structured outputs with a specific schema. | | `seed` | number | No | Integer for deterministic sampling. Same seed with same parameters should return the same result. Not guaranteed across model versions. | | `stop` | string | No | Up to 4 sequences where the model will stop generating. Can be a string or array of strings. | | `stream` | boolean | No | If true, returns server-sent events (SSE) with partial message deltas as tokens are generated, instead of waiting for the full response. | | `stream_options` | object | No | Options for streaming. Use {"include_usage": true} to get a final chunk with token usage statistics. | | `temperature` | number | No | Sampling temperature between 0 and 4. Lower values (e.g. 0.2) produce more focused, deterministic output. Higher values (e.g. 1.5) increase creativity. Default is 1. | | `top_p` | number | No | Nucleus sampling threshold between 0 and 1. Only tokens within this cumulative probability mass are considered. 0.1 means only the top 10%. Use as an alternative to temperature. | | `tools` | array | No | Array of tool/function definitions the model can call. Each tool has {"type": "function", "function": {"name": "...", "description": "...", "parameters": {...}}}. The model may respond with tool_calls instead of content. | | `tool_choice` | string | No | Controls tool calling behavior. 'auto' lets the model decide, 'none' disables tools, 'required' forces a tool call, or pass {"type": "function", "function": {"name": "..."}} to force a specific tool. | | `parallel_tool_calls` | boolean | No | Whether the model can make multiple tool calls in parallel in a single response. Default: true. | | `user` | string | No | A unique string identifying the end user. Useful for abuse monitoring and rate limiting. | | `top_k` | number | No | Top-K sampling. Only the K most likely next tokens are considered. Lower values make output more focused. | | `top_p_min` | number | No | Minimum dynamic nucleus sampling threshold. Sets a floor for top_p when using adaptive sampling. | | `min_p` | number | No | Minimum probability threshold. Tokens below this probability relative to the most likely token are filtered out. | | `repetition_penalty` | number | No | Multiplicative penalty for repeated tokens. Values > 1.0 discourage repetition, < 1.0 encourage it. | | `length_penalty` | number | No | Penalty applied during beam search. Values > 1.0 favor longer sequences, < 1.0 favor shorter ones. | | `early_stopping` | boolean | No | In beam search, stop as soon as the required number of complete candidates are found. | | `bad` | string | No | Words or phrases the model should avoid generating. Passed as a string. | | `bad_token_ids` | array | No | Array of token IDs that should never appear in the output. | | `stop_token_ids` | array | No | Array of token IDs that will cause generation to stop when produced. | | `include_stop_str_in_output` | boolean | No | If true, includes the stop string in the generated output rather than trimming it. | | `ignore_eos` | boolean | No | If true, the model continues generating past the end-of-sequence token. | | `min_tokens` | number | No | Minimum number of tokens to generate before any stop condition can trigger. | | `skip_special_tokens` | boolean | No | If true, special tokens are removed from the output text. Default: true. | | `spaces_between_special_tokens` | boolean | No | If true, adds spaces between special tokens in the detokenized output. | | `truncate_prompt_tokens` | number | No | Truncate the prompt to this many tokens if it exceeds the limit, keeping the most recent tokens. | | `echo` | boolean | No | If true, prepends the last input message to the generated output. | | `add_generation_prompt` | boolean | No | If true, applies the model's generation prompt template. Usually needed for chat models. | | `add_special_tokens` | boolean | No | If true, adds special tokens (like BOS) to the input. Default: true. | | `documents` | array | No | Array of document objects for retrieval-augmented generation (RAG). Each document has content the model can reference when responding. | | `chat_template` | string | No | Custom Jinja2 template for formatting the conversation. Overrides the model's default chat template. | | `chat_template_args` | object | No | Additional arguments passed to the chat template as template variables. | | `best_of` | number | No | Number of candidate completions to generate server-side, returning the best. Currently only supports 1. | | `disaggregated_params` | object | No | Advanced parameters for distributed inference. Only relevant for disaggregated serving configurations. | | `reasoning_effort` | string | No | Controls thinking depth for reasoning models. Options: 'low', 'medium', 'high'. Default: 'medium'. Higher effort uses more tokens but produces more thorough reasoning. Supported on DeepSeek V3/V3.1, GLM 4.6/4.7, and Kimi K2 Thinking. | ```bash curl -X POST 'https://api.orth.sh/v1/run' \ -H 'Authorization: Bearer $ORTHOGONAL_API_KEY' \ -H 'Content-Type: application/json' \ -d '{"api":"baseten","path":"/v1/chat/completions","body":{"messages":"","model":"","frequency_penalty":"","logit_bias":"