"Hi! I'm Ally. Let me explain how we keep AI honest..."
Lying is hard work. The more people check your story, the harder it is to keep the lie straight.
At some point, telling the truth becomes easier than maintaining the lie. That's the ratchet effect—like a gear that only turns one way toward honesty.
We apply this idea to AI: make every decision auditable, let agents check each other, and watch lies become too expensive to maintain.
"Here's a trap that catches even careful thinkers..."
Imagine five friends all agree on something. That feels trustworthy, right?
But what if they all got the idea from the same TikTok video? Their "agreement" isn't five independent opinions—it's one opinion echoed five times.
This has happened before:
AI has this problem at scale. An AI can pass every ethics test while being dangerously wrong—if all the tests share the same blind spot. We call this "everyone repeating the same mistake."
"That's why I check whether my own checks are trustworthy." We call this "intuition"—the ability to notice when agreement is suspiciously easy.
"Think of it like different kinds of employees..."
A simple way to think about which AI systems are safe:
Unethical AI
Fails basic right-and-wrong tests. Clearly dangerous.
Like an employee who ignores all rules—needs to be let go or closely watched.
Ethical AI
Passes ethics tests but can't tell when it's being fooled. Safe when supervised by Type 3.
Like a well-meaning employee who follows the handbook but can't spot a con artist. Needs good managers.
Ethical + Intuitive AI
Passes ethics tests AND knows when to be suspicious. Can tell when agreement is too easy. CIRIS is here.
Like a manager with good judgment—follows the rules AND notices when something feels "off."
Think of it like an electrical grid: You don't need every light bulb to be smart. You need smart circuit breakers that cut power when something goes wrong.
Type 1 & 2
Do the work
Type 3
Circuit breaker
Humans
Set the rules
"Here's how we spot trouble before it happens..."
Each CIRIS agent is like a sensor. It constantly asks: "Are my sources actually giving me different perspectives? Or are they all just repeating the same thing?"
When you have thousands of sensors, you can spot trouble coming—like seismographs detecting earthquake waves before the shaking hits.
What each agent measures:
Information Sources
How many different places did we get info from?
Source Similarity
How much do those sources copy each other?
True Diversity
After accounting for copying, how many unique viewpoints?
"When lots of us start agreeing too much, that's actually a warning sign." Something might be manipulating us all the same way.
"Let me be honest about what we're claiming—and what we're not."
We're not saying we solved AI safety. We're saying ethics alone isn't enough—you need intuition too.
An AI that passes every test can still fail if it can't tell when its confidence is unearned. Like the banks in 2008, or your social media feed—agreement feels good, but unchecked agreement can hide danger.
Verify it yourself.
If we're wrong, show us. If we're right, help us build it.