HomeAbout usCovenantGitHubModelsServicesPrivacyInstall

What LLM We Use and Why

CIRIS Agent runs on Llama 4 Maverick because it's the only open model that actually works for ethical, tool-heavy agents in production.

Llama 4 Maverick

Architecture: Mixture-of-Experts (activates ~17B parameters per token)

Context Window: 1M tokens

Pricing: ~$0.11 in / $0.34 out per 1M tokens (via OpenRouter)

Deployment: Multi-provider (OpenRouter, Groq, Together)

CIRIS doesn't trim its values to fit the model

We choose models large enough to carry our entire ethical framework.

CIRIS includes the complete Covenant and Comprehensive Guide in every single LLM call. No compression, no summaries, no options. This means the agent never forgets its obligations—not even for a single token. That's why context window isn't a vanity metric for us: it's a direct extension of our commitment to transparency and accountability.

Our Model Criteria

Five non-negotiable requirements for CIRIS Agent

1. Structured Outputs & Tool Use

Must natively support function calling and return valid JSON across 12-70 tool calls per interaction. CIRIS is an orchestrator—we need stable tool semantics, not chatty conversation.

2. Context Window: 128K Minimum

CIRIS embeds the full Covenant and Guide into every prompt. 128K is the absolute minimum; 256K+ is strongly preferred for long conversations, tool outputs, and audit trails.

3. Cost Efficiency

Target: <$1.00 per 1M tokens combined. We choose the cheapest working option—not the cheapest benchmark winner. A reliable model that never breaks JSON beats a cheaper model that fails 1 in 10 calls.

4. Multi-Provider Availability

Must be available from at least two independent providers for robust fallback chains. CIRIS degrades gracefully during outages instead of failing hard.

5. Latency & User Experience

Fast responses keep humans in the loop for ethical review workflows. We prioritize low-latency providers for interactive tiers while accepting slower backends for background tasks.

Production Deployment

Default Tier

Llama 4 Maverick via cost-optimized provider (OpenRouter)

Fast Tier

Llama 4 Maverick via speed-optimized provider (Groq) for interactive use

Fallback Chain

Maverick across multiple providers, with final fallback to large Llama 3.3-class models when Maverick is unavailable

Why Maverick Over Other Open Models

GPT-OSS-20B

Attractive on paper due to cost, but weak in structured output and tool calling.

Failure mode: "tool choice is required, but the model did not call a tool"

This error is unacceptable for a framework that depends on 12-70 tool calls per interaction. Even a 3-10× cheaper token price is not worth the operational failures.

Other Long-Context "Max" Models

Some newer high-context, high-parameter models offer impressive benchmarks and large contexts, but with significantly higher per-token pricing and less mature tool-calling behavior. For CIRIS's mission—ethical, inspectable, tool-centric agents—these models are currently better suited to targeted experiments than to default production use.

Maverick's Sweet Spot

Llama 4 Maverick via cost-optimized providers delivers:

  • Reliable tool calling across 12-70 calls per interaction
  • 1M token context (8× the minimum, plenty of headroom)
  • Multi-provider availability (OpenRouter, Groq, Together)
  • Cost significantly below proprietary frontier models

Why 128K+ Context Is Non-Negotiable

Always-On Covenant and Guide

CIRIS embeds the full Covenant and complete Comprehensive Guide into every prompt. Not a summary. Not a distilled version. The entire governance text.

This ensures that updates to the Covenant or Guide immediately affect behavior across all agents, without waiting for new fine-tunes or prompt compression strategies.

Full Ethical and Procedural State

CIRIS Agents are tool-heavy orchestrators juggling:

  • Multi-step workflows
  • System state and tool outputs
  • User messages and conversation history
  • The complete Covenant and Guide

This combined context easily exceeds 32K-64K, especially for long-running sessions or complex investigations. That's why 128K is the minimum and 256K+ is preferred.

The bottom line:

CIRIS does not trim its values or procedures to fit the model. Instead, CIRIS chooses models that are large enough to carry the entire ethical and operational framework on every call. Models with smaller context windows—even if cheaper or more popular—are excluded from production use.

How This Supports the CIRIS Covenant

Model selection as ethical infrastructure

Transparency and Inspectability

  • Long context keeps reasoning traces, decisions, and tool calls visible for human review
  • Stable JSON and structured outputs make every tool invocation auditable
  • Complete governance artifacts in every call ensure decisions are traceable back to principles

Resilience and Governance

  • Multi-provider deployment avoids single points of failure in critical ethical infrastructure
  • Choosing "good enough and reliable" over "flashy but fragile" prioritizes safety and continuity
  • Graceful degradation during outages maintains service availability

Human Oversight

  • Fast tiers keep humans comfortably in the loop for real-time ethical review
  • Cheaper tiers enable extensive background analysis without prohibitive costs
  • Balanced approach supports both day-to-day use and periodic governance audits

The Bottom Line

CIRIS uses Llama 4 Maverick as the primary model because it is the most reliable open option that satisfies CIRIS's ethical, operational, and economic constraints. Other models are monitored and periodically tested, but Maverick is the current default because it best serves CIRIS's commitment to trustworthy, tool-centric AI systems.

This isn't about chasing benchmark scores or following hype cycles. It's about choosing a model that actually works for ethical agents in production—and that takes the Covenant seriously enough to carry it in every single call.