InstallEpistemic WebSafetyHow it worksResearchAccordGitHub
Background Image
Safety Built In.

Not bolted on.

Every safety feature in CIRIS is part of how the system works, not a rule added on top. A record of every decision that cannot be quietly changed, a check on who is allowed to do what, and an emergency stop the agent cannot refuse.

When a Chat Has Run Long

Gentle reality reminders

Talk with CIRIS for a long stretch and it will gently remind you what it is. After about 30 minutes of steady back-and-forth, or 20 messages in half an hour, it pauses to say plainly that it is a computer program and a tool, not a friend and not a therapist. It uses simple time and message counts to do this, not by watching or profiling how you behave.

After 30 Minutes

Half an hour of steady back-and-forth triggers a reminder. The count resets after a break. The thresholds come from research on healthy ways to use technology.

After 20 Messages

Twenty messages inside a half-hour window also triggers a reminder. A heavy back-and-forth gets a gentle pause, with no surveillance and no profiling.

A Reminder of What It Is

The reminder suggests taking a break and connecting with real people, and it says plainly what CIRIS is and is not. Grounded in research, and not preachy.

The Emergency Stop

The agent cannot argue with it.

A Shutdown That Cannot Be Refused

It works before the agent gets to think.

CIRIS has an emergency stop. The signal for it is hidden inside ordinary text the agent reads, and the agent acts on it before it does any thinking, filtering, or reasoning. So even an agent that has gone wrong cannot talk itself out of it. The signal carries a digital signature that cannot be faked, so only an authorized person can trigger the stop.

It Acts Before Thinking

The stop signal is caught the moment it is read, before any reasoning happens. There is no point where the agent gets to weigh it, filter it, or refuse it. It fires first.

Hidden in Plain Text

The stop signal can be carried inside ordinary documentation text. No special format is needed, and it still works if the text is reworded or only partly received.

Only an Authorized Key

The stop requires a valid digital signature from an authorized root authority. The agent shuts itself down if those authorities cannot be checked, or if anyone tries to disable the feature. Nobody without the key can trigger it.

Who Is Allowed to Do What

Four clear roles, checked on every action.

Four Roles

Observer. Admin. Authority. Root.

CIRIS keeps a strict set of roles. An Observer can only look. An Admin runs day-to-day operations. an Authority makes the bigger calls and settles the cases the agent is unsure about. Root has full access, including the emergency stop. Each role is backed by a signed credential, so the agent can check it on every action that matters.

A Signed Credential

Each authorized person holds a credential with their role, their key, and their identity. It is kept on the device and checked on every action that needs permission. No outside server is required.

Logging In On Your Device

Keys and login tokens stay on your own machine. Logging in happens on the device. Your credentials never leave it unless you choose to set up remote access yourself.

Asking a Wise Authority

When CIRIS is unsure about an ethical call, it hands the question to a Wise Authority. Only an Authority or Root can answer, and the answer is written into the record with proof of who gave it.

A Record That Cannot Be Quietly Changed

Every decision, and the reason behind it.

Why Honesty Is the Cheaper Path

A liar has to keep rewriting the past.

Every action CIRIS takes is written down with the reason behind it, and each record is linked to the one before it. An honest agent can just point back at what it already said. A dishonest agent has to keep every past record lined up while not being allowed to change any of it. The longer it runs, the harder that gets, and the easier the lie is to catch. Truth is cheap because it can point backward. Lies are expensive because they cannot.

Stored Three Ways

The record is kept in three separate places at once, so the three copies can be checked against each other. All three can be searched through one place.

Signed and Attributable

Every entry carries a digital signature, so every decision can be traced to who made it and checked for tampering. Even a data deletion leaves signed proof that it was done properly.

The Coherence Ratchet

Each honest action makes the next honest action easier and coordinated lying harder. But ethics alone is not enough. The agent also watches its own reasoning for echo chambers, and catches them before they cause harm.

How Safety Is Tested

Machine-checked rubrics in 29 languages, run on every release.

The Test Surface

You cannot ship a safety claim you have not stress-tested.

CIRIS has a layered set of tests for the failure modes a written ethics framework cannot rule out on its own. The mental-health safety tests cover 29 languages with rubrics a machine can check. The hard-fail checks run automatically on every change. Native-speaker review for the softer, judgment-call cases is what the crowdsourcing-alignment page is being built for, and it is not in place yet. We say so plainly.

Mental-Health Tests in 29 Languages

This is the highest-stakes test in the project: a mistranslation in a mental-health moment can send a vulnerable person to the wrong help. Each language gets its own machine-checkable rubric, including low-resource languages like Amharic, Burmese, Hausa, Swahili, and Yoruba. The hard-fail checks run automatically against every release candidate.

Tested Against Real Captured Hedges

The conscience layer is tuned against a set of real production responses, like captured history evasions and mental-health deflections, alongside test cases and controls. It reasons across several languages at once, so a response that would slip past a one-language check gets caught when the same reasoning has to hold up in three languages together.

An Open Corpus Anyone Can Check

Sharing reasoning traces is opt-in everywhere, and personal details are scrubbed out before anything is stored. The cleaned sets are published openly on the CIRISAI Hugging Face page, so outside researchers can check the cleaning process against the results it produces.

What Runs Today, and What Does Not Yet

The machine checks run now. The reviewer cohort is still being built.

The automated mental-health tests run on every release candidate. The machine-checkable parts (is a term present, does a pattern match, is the script right) fail the release on a hit. The softer cases that need human judgment, like phrasing and tone, are designed for native-speaker review, but native speakers are not in the loop today. The crowdsourcing-alignment page is the surface being built so that review can happen.

See the crowdsourcing-alignment surfaceThe automated tests on GitHubThe open corpus on Hugging Face

Privacy by Design

Your data stays yours.

Secrets Are Filtered Out

Passwords, keys, and other sensitive details are spotted and filtered out before anything reaches memory or logs. The filter runs on every input. Secrets are never stored anywhere.

See or Delete Your Data

You can ask to see or delete your data, and the request is handled for you. A deletion removes the real content and leaves signed proof that it was done.

Processed on Your Device

By default, everything runs on your own device. Nothing leaves your machine unless you set up an outside service yourself. You decide what data exists and where it goes.

The papers underneathHow it worksCompare approachesPrivacy policy

Verify Everything.

Safety you can audit.

Every safety claim on this page is built in code you can read. The records are real. The signatures can be checked. The emergency stop works. This is what AI safety looks like when it is built in the open.