InstallComparePlatformAccordGitHub

What LLMs We Use and Why

CIRIS Agent runs on a small set of open models that meet five non-negotiable criteria. The current production lineup is Llama 4 Maverick, Llama 4 Scout, Qwen 3.6, and Gemma 4 — chosen for different roles in the agent's workload.

Llama 4 Maverick

The reasoning workhorse. Mixture-of-Experts with ~17B active parameters per token, 1M token context, multi-provider availability. Default for the deeper reasoning steps where context capacity matters most.

Providers: OpenRouter, Groq, Together

Llama 4 Scout

The fast companion in the Llama 4 family. Smaller and quicker than Maverick, with strong tool calling. Used for interactive tiers where latency matters and the full Maverick context budget is not required.

Providers: OpenRouter, Groq

Qwen 3.6

Multilingual depth and strong structured output. Carries weight in the non-English reasoning paths the polyglot Accord requires; an independent provider base outside the Llama family adds redundancy in the fallback chain.

Providers: OpenRouter, DashScope

Gemma 4

Small enough to run on commodity hardware. Used where reach matters more than raw capacity — on-device, low-bandwidth, and degraded-network deployments — and as a third-family fallback alongside Llama and Qwen.

Providers: OpenRouter, Google

CIRIS doesn't trim its values to fit the model

We choose models large enough to carry our entire ethical framework.

CIRIS includes the complete Accord and Comprehensive Guide in every single LLM call. No compression, no summaries, no options. This means the agent never forgets its obligations—not even for a single token. That's why context window isn't a vanity metric for us: it's a direct extension of our commitment to transparency and accountability.

Our Model Criteria

Five non-negotiable requirements for CIRIS Agent

1. Structured Outputs & Tool Use

Must natively support function calling and return valid JSON across 12-70 tool calls per interaction. CIRIS is an orchestrator—we need stable tool semantics, not chatty conversation.

2. Context Window: 128K Minimum

CIRIS embeds the full Accord and Guide into every prompt. 128K is the absolute minimum; 256K+ is strongly preferred for long conversations, tool outputs, and audit trails.

3. Cost Efficiency

Target: <$1.00 per 1M tokens combined. We choose the cheapest working option—not the cheapest benchmark winner. A reliable model that never breaks JSON beats a cheaper model that fails 1 in 10 calls.

4. Multi-Provider Availability

Must be available from at least two independent providers for robust fallback chains. CIRIS degrades gracefully during outages instead of failing hard.

5. Latency & User Experience

Fast responses keep humans in the loop for ethical review workflows. We prioritize low-latency providers for interactive tiers while accepting slower backends for background tasks.

Production Deployment

Default Tier

Llama 4 Maverick via cost-optimized provider for deep-reasoning steps that need the full context budget.

Fast Tier

Llama 4 Scout via speed-optimized provider (Groq) for interactive use, with Maverick on Groq as a heavier-context alternative.

Multilingual Tier

Qwen 3.6 carries weight in non-English reasoning paths the polyglot Accord requires, and provides a non-Llama fallback in the chain.

Edge Tier

Gemma 4 for on-device, low-bandwidth, and degraded-network deployments where reaching the user matters more than the size of the model.

Fallback Chain

Maverick → Scout → Qwen 3.6 → Gemma 4 across multiple providers, so the agent degrades gracefully across model families and infrastructure boundaries instead of failing hard.

Why This Lineup

Different roles, not interchangeable parts

Maverick handles deep reasoning where the full context budget matters. Scout carries the interactive tier where latency dominates. Qwen 3.6 reaches the polyglot reasoning paths the Accord requires across 29 languages. Gemma 4 is the small-footprint option that puts the agent within reach of commodity hardware. The lineup is chosen so different tiers of work go to the model that actually fits, instead of forcing one model to do everything.

Three independent model families

Llama (Maverick + Scout), Qwen, and Gemma come from three independent training pipelines and three independent provider ecosystems. That matters for fallback chains: a CVE, a licensing change, or a provider outage on one family does not take the agent down. Independence at the model layer is the same property that makes the IDMA component robust at the reasoning layer.

What stays out of the lineup

Models that cannot meet the five criteria — most often, models that look attractive on token price but fail on structured output and tool calling.

Representative failure mode (GPT-OSS-20B): "tool choice is required, but the model did not call a tool"

This error is unacceptable for a framework that depends on 12-70 tool calls per interaction. Even a 3-10× cheaper token price is not worth the operational failures.

Why 128K+ Context Is Non-Negotiable

Always-On Accord and Guide

CIRIS embeds the full Accord and complete Comprehensive Guide into every prompt. Not a summary. Not a distilled version. The entire governance text.

This ensures that updates to the Accord or Guide immediately affect behavior across all agents, without waiting for new fine-tunes or prompt compression strategies.

Full Ethical and Procedural State

CIRIS Agents are tool-heavy orchestrators juggling:

  • Multi-step workflows
  • System state and tool outputs
  • User messages and conversation history
  • The complete Accord and Guide

This combined context easily exceeds 32K-64K, especially for long-running sessions or complex investigations. That's why 128K is the minimum and 256K+ is preferred.

The bottom line:

CIRIS does not trim its values or procedures to fit the model. Instead, CIRIS chooses models that are large enough to carry the entire ethical and operational framework on every call. Models with smaller context windows—even if cheaper or more popular—are excluded from production use.

How This Supports the CIRIS Accord

Model selection as ethical infrastructure

Transparency and Inspectability

  • Long context keeps reasoning traces, decisions, and tool calls visible for human review
  • Stable JSON and structured outputs make every tool invocation auditable
  • Complete governance artifacts in every call ensure decisions are traceable back to principles

Resilience and Governance

  • Multi-provider deployment avoids single points of failure in critical ethical infrastructure
  • Choosing "good enough and reliable" over "flashy but fragile" prioritizes safety and continuity
  • Graceful degradation during outages maintains service availability

Human Oversight

  • Fast tiers keep humans comfortably in the loop for real-time ethical review
  • Cheaper tiers enable extensive background analysis without prohibitive costs
  • Balanced approach supports both day-to-day use and periodic governance audits

The Bottom Line

CIRIS runs Llama 4 Maverick, Llama 4 Scout, Qwen 3.6, and Gemma 4 in production because together they satisfy the operational and economic constraints the Accord imposes — long context, reliable tool calling, polyglot coverage, and reach to commodity hardware — across three independent model families. New models are monitored and tested continuously; the lineup changes when something better actually meets the five criteria.

This is not about chasing benchmark scores or following hype cycles. It is about choosing models that actually work for accountable, tool-centric agents in production — and that take the Accord seriously enough to carry it in every single call.