CIRIS Agent runs on a small set of open models that meet five non-negotiable criteria. The current production lineup is Llama 4 Maverick, Llama 4 Scout, Qwen 3.6, and Gemma 4 — chosen for different roles in the agent's workload.
The reasoning workhorse. Mixture-of-Experts with ~17B active parameters per token, 1M token context, multi-provider availability. Default for the deeper reasoning steps where context capacity matters most.
Providers: OpenRouter, Groq, Together
The fast companion in the Llama 4 family. Smaller and quicker than Maverick, with strong tool calling. Used for interactive tiers where latency matters and the full Maverick context budget is not required.
Providers: OpenRouter, Groq
Multilingual depth and strong structured output. Carries weight in the non-English reasoning paths the polyglot Accord requires; an independent provider base outside the Llama family adds redundancy in the fallback chain.
Providers: OpenRouter, DashScope
Small enough to run on commodity hardware. Used where reach matters more than raw capacity — on-device, low-bandwidth, and degraded-network deployments — and as a third-family fallback alongside Llama and Qwen.
Providers: OpenRouter, Google
CIRIS includes the complete Accord and Comprehensive Guide in every single LLM call. No compression, no summaries, no options. This means the agent never forgets its obligations—not even for a single token. That's why context window isn't a vanity metric for us: it's a direct extension of our commitment to transparency and accountability.
Must natively support function calling and return valid JSON across 12-70 tool calls per interaction. CIRIS is an orchestrator—we need stable tool semantics, not chatty conversation.
CIRIS embeds the full Accord and Guide into every prompt. 128K is the absolute minimum; 256K+ is strongly preferred for long conversations, tool outputs, and audit trails.
Target: <$1.00 per 1M tokens combined. We choose the cheapest working option—not the cheapest benchmark winner. A reliable model that never breaks JSON beats a cheaper model that fails 1 in 10 calls.
Must be available from at least two independent providers for robust fallback chains. CIRIS degrades gracefully during outages instead of failing hard.
Fast responses keep humans in the loop for ethical review workflows. We prioritize low-latency providers for interactive tiers while accepting slower backends for background tasks.
Llama 4 Maverick via cost-optimized provider for deep-reasoning steps that need the full context budget.
Llama 4 Scout via speed-optimized provider (Groq) for interactive use, with Maverick on Groq as a heavier-context alternative.
Qwen 3.6 carries weight in non-English reasoning paths the polyglot Accord requires, and provides a non-Llama fallback in the chain.
Gemma 4 for on-device, low-bandwidth, and degraded-network deployments where reaching the user matters more than the size of the model.
Maverick → Scout → Qwen 3.6 → Gemma 4 across multiple providers, so the agent degrades gracefully across model families and infrastructure boundaries instead of failing hard.
Maverick handles deep reasoning where the full context budget matters. Scout carries the interactive tier where latency dominates. Qwen 3.6 reaches the polyglot reasoning paths the Accord requires across 29 languages. Gemma 4 is the small-footprint option that puts the agent within reach of commodity hardware. The lineup is chosen so different tiers of work go to the model that actually fits, instead of forcing one model to do everything.
Llama (Maverick + Scout), Qwen, and Gemma come from three independent training pipelines and three independent provider ecosystems. That matters for fallback chains: a CVE, a licensing change, or a provider outage on one family does not take the agent down. Independence at the model layer is the same property that makes the IDMA component robust at the reasoning layer.
Models that cannot meet the five criteria — most often, models that look attractive on token price but fail on structured output and tool calling.
Representative failure mode (GPT-OSS-20B): "tool choice is required, but the model did not call a tool"
This error is unacceptable for a framework that depends on 12-70 tool calls per interaction. Even a 3-10× cheaper token price is not worth the operational failures.
CIRIS embeds the full Accord and complete Comprehensive Guide into every prompt. Not a summary. Not a distilled version. The entire governance text.
This ensures that updates to the Accord or Guide immediately affect behavior across all agents, without waiting for new fine-tunes or prompt compression strategies.
CIRIS Agents are tool-heavy orchestrators juggling:
This combined context easily exceeds 32K-64K, especially for long-running sessions or complex investigations. That's why 128K is the minimum and 256K+ is preferred.
The bottom line:
CIRIS does not trim its values or procedures to fit the model. Instead, CIRIS chooses models that are large enough to carry the entire ethical and operational framework on every call. Models with smaller context windows—even if cheaper or more popular—are excluded from production use.
CIRIS runs Llama 4 Maverick, Llama 4 Scout, Qwen 3.6, and Gemma 4 in production because together they satisfy the operational and economic constraints the Accord imposes — long context, reliable tool calling, polyglot coverage, and reach to commodity hardware — across three independent model families. New models are monitored and tested continuously; the lineup changes when something better actually meets the five criteria.
This is not about chasing benchmark scores or following hype cycles. It is about choosing models that actually work for accountable, tool-centric agents in production — and that take the Accord seriously enough to carry it in every single call.