Pricing overview
Together AI provides a usage-based pricing structure for its large language model (LLM) inference and fine-tuning services. The primary cost drivers are the specific model chosen, the volume of tokens processed for inference, and the compute time (measured hourly) for fine-tuning tasks. This pay-as-you-go approach is designed to offer flexibility, allowing users to scale their usage up or down without fixed subscriptions or long-term commitments Together AI pricing page. The platform emphasizes making open-source LLMs more accessible and cost-effective for developers and researchers.
Key components of Together AI's pricing include:
- Inference API: Billed per million tokens, with separate rates for input (prompt) and output (completion) tokens. Prices vary significantly across different models, reflecting their computational complexity and performance characteristics. Together AI hosts a range of open-source models, including those from Meta, Mistral AI, and others, each with its own pricing structure Together AI model pricing.
- Fine-tuning API: Charged on an hourly basis for the GPU compute time utilized during the fine-tuning process. This includes the time spent training custom models on user-provided datasets. The cost depends on the GPU type allocated and the duration of the training job.
- Serverless GPUs: Provides on-demand GPU access for custom workloads, billed by the second or hour depending on the specific GPU instance and its configuration. This service is intended for users requiring more granular control over their compute resources.
Together AI also offers enterprise-level solutions with custom pricing for high-volume users or those with specific infrastructure and support requirements. These plans typically include dedicated resources, enhanced service level agreements (SLAs), and specialized technical support Together AI enterprise options.
Plans and tiers
Together AI primarily operates on a single, flexible pay-as-you-go model rather than distinct subscription tiers for its core API services. This means all users, from individual developers to large enterprises, access the same API capabilities and are billed directly based on their consumption. The differentiation in cost comes entirely from usage volume and the specific models or compute resources chosen.
The pricing structure is transparent, with detailed breakdowns available for each supported model and GPU configuration. There are no monthly fees or minimum spend requirements for the standard pay-as-you-go service. This model is particularly beneficial for projects with fluctuating demands or those in early development stages, as it avoids upfront costs and commitment Together AI pricing details.
Pay-as-you-go model pricing (examples)
Below is an illustrative table summarizing example inference costs for selected models, based on information from Together AI's official pricing page. Prices are subject to change and should be verified on the official website.
| Model Name | Input Tokens (per 1M) | Output Tokens (per 1M) | Best For |
|---|---|---|---|
| Llama-2-7B-Chat | $0.25 | $0.25 | General-purpose chat, rapid prototyping |
| Mistral-7B-Instruct-v0.2 | $0.20 | $0.20 | Instruction following, coding assistance |
| Mixtral-8x7B-Instruct-v0.1 | $0.40 | $0.40 | Complex reasoning, multi-task applications |
| Qwen-1.5-14B-Chat | $0.45 | $0.45 | Multilingual chat, extended context |
| CodeLlama-34B-Instruct | $0.80 | $0.80 | Advanced code generation, refactoring |
Fine-tuning pricing (examples)
Fine-tuning costs are calculated based on the GPU type and the duration of the training run. Here are example hourly rates for fine-tuning, which can be seen on the official pricing page:
| GPU Type | Hourly Rate | Considerations |
|---|---|---|
| A100 (40GB) | $1.50 - $2.50+ | High-performance large model training |
| A10 (24GB) | $0.75 - $1.25 | Balanced performance for medium-sized models |
Actual fine-tuning costs will depend on dataset size, model complexity, and training parameters. Users are encouraged to estimate their specific needs and consult the Together AI documentation for precise, up-to-date rates.
Free tier and limits
Together AI provides a free tier designed to allow new users to explore the platform's capabilities without an initial financial commitment. This free tier includes up to $25 in free credits upon account creation Together AI free credits. These credits can be applied towards any of Together AI's services, including LLM inference and fine-tuning.
The purpose of the free credits is to facilitate:
- Experimentation: Developers can test various open-source models to determine which best fits their application or research needs.
- Proof-of-Concept Development: Small projects or initial prototypes can be built and evaluated using the free resources.
- Learning and Education: Students and researchers can gain hands-on experience with LLM deployment and fine-tuning.
Once the $25 in credits are exhausted, usage automatically transitions to the standard pay-as-you-go pricing model. To avoid service interruption, users will typically need to add a payment method to their account before reaching the credit limit. Specific limits on the duration of credit validity or maximum concurrent requests under the free tier are generally detailed in the user agreement or on the pricing page.
Real-world cost examples
Understanding Together AI's pricing in practice often benefits from concrete examples. The actual cost will depend heavily on the chosen model, the volume of tokens, and the duration of fine-tuning tasks.
Inference API scenarios
-
Basic Chatbot (Low Volume):
- Scenario: A developer builds a simple chatbot using
Llama-2-7B-Chatfor internal team communication. Approximately 1 million input tokens and 1 million output tokens are generated per month. - Calculation (example rates): ($0.25/M input tokens * 1M) + ($0.25/M output tokens * 1M) = $0.25 + $0.25 = $0.50 per month.
- Outcome: Very low cost, easily covered by the free credits for many months.
- Scenario: A developer builds a simple chatbot using
-
Content Generation (Medium Volume):
- Scenario: A content agency uses
Mistral-7B-Instruct-v0.2to generate article drafts and summaries, processing 50 million input tokens and 40 million output tokens per month. - Calculation (example rates): ($0.20/M input tokens * 50M) + ($0.20/M output tokens * 40M) = $10.00 + $8.00 = $18.00 per month.
- Outcome: A moderate monthly cost, demonstrating scalability for regular usage.
- Scenario: A content agency uses
-
Complex Application (High Volume, Advanced Model):
- Scenario: A research team uses
Mixtral-8x7B-Instruct-v0.1for detailed data analysis and code generation, processing 200 million input tokens and 150 million output tokens per month. - Calculation (example rates): ($0.40/M input tokens * 200M) + ($0.40/M output tokens * 150M) = $80.00 + $60.00 = $140.00 per month.
- Outcome: Higher cost due to increased volume and a more powerful, computationally expensive model.
- Scenario: A research team uses
Fine-tuning scenarios
-
Initial Model Customization (Small Dataset):
- Scenario: A startup fine-tunes
Llama-2-7Bon a small dataset for a specialized task. The training job runs on an A10 GPU for approximately 5 hours. - Calculation (example rate): $0.75/hour * 5 hours = $3.75.
- Outcome: Very affordable for initial customization and iterative development.
- Scenario: A startup fine-tunes
-
Production-Ready Fine-tuning (Large Dataset):
- Scenario: An enterprise fine-tunes a larger model like
CodeLlama-34Bon a substantial proprietary dataset. The training requires an A100 GPU for 24 hours. - Calculation (example rate): $1.50/hour * 24 hours = $36.00.
- Outcome: A higher cost reflecting intensive compute usage for a production-grade model.
- Scenario: An enterprise fine-tunes a larger model like
How the pricing compares
Together AI positions itself as a cost-effective provider for running and fine-tuning open-source LLMs compared to proprietary models offered by major cloud providers or other specialized LLM platforms. The pricing model generally aims to be competitive, especially for users who prioritize access to a wide array of open-source models and have fluctuating compute demands.
Comparison with Major Cloud Providers (e.g., AWS, Azure, Google Cloud):
- Proprietary Models: Platforms like Google Cloud's Vertex AI or AWS's Bedrock offer access to proprietary models (e.g., Google's Gemini, Anthropic's Claude, Amazon's Titan). These models can sometimes have higher per-token costs, particularly for advanced versions, but may also offer different performance characteristics or specialized capabilities Google Cloud Vertex AI pricing.
- Managed Services: Major cloud providers often include additional managed service fees on top of token costs, which can increase the overall expenditure. Together AI's pricing is more narrowly focused on the compute and inference units.
- Open-Source Hosting: While major clouds do offer infrastructure for hosting open-source models on dedicated GPUs, users are often responsible for the full operational overhead, including provisioning, scaling, and maintenance. Together AI abstracts much of this complexity, offering a serverless experience at specific token/hourly rates.
Comparison with Other Specialized LLM Platforms (e.g., Anyscale, Fireworks AI):
- Similar Models: Competitors like Anyscale Endpoints and Fireworks AI also focus on providing access to open-source LLMs with pay-as-you-go pricing. Pricing structures are often similar, based on input/output tokens for inference and hourly rates for fine-tuning.
- Feature Set: Differences in pricing might arise from additional features, such as advanced analytics, specific compliance certifications, or dedicated support models. Together AI's SOC 2 Type II compliance Together AI compliance details can be a factor for enterprises.
- Model Selection: The specific range and versions of open-source models offered can vary, impacting perceived value. Together AI maintains a broad and frequently updated catalog of models.
In summary, Together AI's pricing is generally competitive for users focused on open-source LLMs, offering a transparent, usage-based model that can be more cost-effective than some alternatives, especially for high-volume inference or extensive fine-tuning projects that benefit from serverless GPU access.