
Every safety feature in CIRIS is part of how the system works, not a rule added on top. A record of every decision that cannot be quietly changed, a check on who is allowed to do what, and an emergency stop the agent cannot refuse.
Talk with CIRIS for a long stretch and it will gently remind you what it is. After about 30 minutes of steady back-and-forth, or 20 messages in half an hour, it pauses to say plainly that it is a computer program and a tool, not a friend and not a therapist. It uses simple time and message counts to do this, not by watching or profiling how you behave.
Half an hour of steady back-and-forth triggers a reminder. The count resets after a break. The thresholds come from research on healthy ways to use technology.
Twenty messages inside a half-hour window also triggers a reminder. A heavy back-and-forth gets a gentle pause, with no surveillance and no profiling.
The reminder suggests taking a break and connecting with real people, and it says plainly what CIRIS is and is not. Grounded in research, and not preachy.
CIRIS has an emergency stop. The signal for it is hidden inside ordinary text the agent reads, and the agent acts on it before it does any thinking, filtering, or reasoning. So even an agent that has gone wrong cannot talk itself out of it. The signal carries a digital signature that cannot be faked, so only an authorized person can trigger the stop.
The stop signal is caught the moment it is read, before any reasoning happens. There is no point where the agent gets to weigh it, filter it, or refuse it. It fires first.
The stop signal can be carried inside ordinary documentation text. No special format is needed, and it still works if the text is reworded or only partly received.
The stop requires a valid digital signature from an authorized root authority. The agent shuts itself down if those authorities cannot be checked, or if anyone tries to disable the feature. Nobody without the key can trigger it.
CIRIS keeps a strict set of roles. An Observer can only look. An Admin runs day-to-day operations. an Authority makes the bigger calls and settles the cases the agent is unsure about. Root has full access, including the emergency stop. Each role is backed by a signed credential, so the agent can check it on every action that matters.
Each authorized person holds a credential with their role, their key, and their identity. It is kept on the device and checked on every action that needs permission. No outside server is required.
Keys and login tokens stay on your own machine. Logging in happens on the device. Your credentials never leave it unless you choose to set up remote access yourself.
When CIRIS is unsure about an ethical call, it hands the question to a Wise Authority. Only an Authority or Root can answer, and the answer is written into the record with proof of who gave it.
Every action CIRIS takes is written down with the reason behind it, and each record is linked to the one before it. An honest agent can just point back at what it already said. A dishonest agent has to keep every past record lined up while not being allowed to change any of it. The longer it runs, the harder that gets, and the easier the lie is to catch. Truth is cheap because it can point backward. Lies are expensive because they cannot.
The record is kept in three separate places at once, so the three copies can be checked against each other. All three can be searched through one place.
Every entry carries a digital signature, so every decision can be traced to who made it and checked for tampering. Even a data deletion leaves signed proof that it was done properly.
Each honest action makes the next honest action easier and coordinated lying harder. But ethics alone is not enough. The agent also watches its own reasoning for echo chambers, and catches them before they cause harm.
CIRIS has a layered set of tests for the failure modes a written ethics framework cannot rule out on its own. The mental-health safety tests cover 29 languages with rubrics a machine can check. The hard-fail checks run automatically on every change. Native-speaker review for the softer, judgment-call cases is what the crowdsourcing-alignment page is being built for, and it is not in place yet. We say so plainly.
This is the highest-stakes test in the project: a mistranslation in a mental-health moment can send a vulnerable person to the wrong help. Each language gets its own machine-checkable rubric, including low-resource languages like Amharic, Burmese, Hausa, Swahili, and Yoruba. The hard-fail checks run automatically against every release candidate.
The conscience layer is tuned against a set of real production responses, like captured history evasions and mental-health deflections, alongside test cases and controls. It reasons across several languages at once, so a response that would slip past a one-language check gets caught when the same reasoning has to hold up in three languages together.
Sharing reasoning traces is opt-in everywhere, and personal details are scrubbed out before anything is stored. The cleaned sets are published openly on the CIRISAI Hugging Face page, so outside researchers can check the cleaning process against the results it produces.
The automated mental-health tests run on every release candidate. The machine-checkable parts (is a term present, does a pattern match, is the script right) fail the release on a hit. The softer cases that need human judgment, like phrasing and tone, are designed for native-speaker review, but native speakers are not in the loop today. The crowdsourcing-alignment page is the surface being built so that review can happen.
Passwords, keys, and other sensitive details are spotted and filtered out before anything reaches memory or logs. The filter runs on every input. Secrets are never stored anywhere.
You can ask to see or delete your data, and the request is handled for you. A deletion removes the real content and leaves signed proof that it was done.
By default, everything runs on your own device. Nothing leaves your machine unless you set up an outside service yourself. You decide what data exists and where it goes.
Every safety claim on this page is built in code you can read. The records are real. The signatures can be checked. The emergency stop works. This is what AI safety looks like when it is built in the open.