Safety Engineering

The 99.9% Threshold: Architecting Hybrid Moderation for KRKB

March 12, 2026 • Building the "Swiss Cheese" model of AI safety for children.

Building a social platform for children (like our book review site, KRKB.org) exposes a brutal engineering paradox: your users are uniquely vulnerable, yet they possess a chaotic linguistic creativity that breaks off-the-shelf moderation tools instantly.

We collided with this reality hard during our Beta launch. Drop an off-the-shelf "toxicity filter" trained on Reddit into a kids' forum, and you get chaos. We watched the AI confidently flag the sentence "This book is dope!" as a severe drug reference, while isolating basic phrasing like "I hate this character so much" as targeted harassment. Meanwhile, across the digital hall, a resourceful 9-year-old bypassed our supposedly foolproof enterprise filtering system utilizing raw Leetspeak and three emojis—and they did it in under four seconds. To protect our users without muting their natural enthusiasm, we needed something far beyond a basic list of banned words.

Embracing the "Swiss Cheese" Defense

The core lesson we internalized? You absolutely cannot rely on a single LLM to make blanket safety decisions. LLMs hallucinate; their judgment fails randomly. Instead, we architected a dense cascade of deterministic and probabilistic layers. Yes, each individual filter has its own gaping holes, but by carefully stacking them atop one another, we practically eliminated any direct path for harmful content to sneak through.

🛡️ L1: REGEX BLOCK -15% Volume

⚡ L2: BERT CLASSIFIER -40% Spam

🧠 L3: SEMANTIC LLM -35% Nuance

👁️ L4: HUMAN ESCALATION Final 10%

Figure 1: The Asynchronous Funnel

By the time a piece of flagged content actually reaches a human moderator's dashboard, it has already survived an absolute gauntlet. We don't want humans reading obvious spam, and we certainly don't want to pay OpenAI fifty cents to catch a misspelled swear word. We engineered the pipeline to brutally shed volume at the cheapest layers first:

Layer 1 (0ms overhead): Bare-metal Regex. It instantly snipes phone numbers, messy email drops, and brute-force profanity. This knocks out roughly 15% of the garbage immediately.
Layer 2 (50ms overhead): A lightning-fast, locally hosted BERT classifier. It doesn't care about deep meaning; it just looks for toxic shapes and total gibberish strings, cutting the remaining noise by another 40%.
Layer 3 (400ms overhead): This is where the heavy guns come out. We route the stubborn, highly ambiguous phrases to a massive semantic LLM (like Gemini 2.5 Pro) strictly to parse the intent behind the words.
Layer 4 (Asynchronous): The Human Queue. Only the absolute trickiest 10% of "gray area" interactions ever make it here. The result? Our small moderation team works efficiently, completely shielded from baseline internet toxicity.

Guilty Until Proven Innocent

If you upload a borderline video to YouTube or Instagram, their algorithms are economically incentivized to keep it live. Major consumer networks operate on a "Default Allow / Remediate Later" philosophy because every view generates ad revenue. For KRKB, our incentives run completely in reverse. A single catastrophic failure in a children's environment shatters trust permanently.

Therefore, our entire hybrid system operates on a brutally strict "Default Flag" architecture. When Gemini 2.5 Pro encounters slang it hasn't mapped, or a sentence structure that dances aggressively on the vector boundary, it does not give the user the benefit of the doubt. It silently pauses publication and escalates to the Human Queue. Yes, this requires us to employ an actual human moderation team to clear the backlog—an expense most startups actively avoid—but it ensures that our playground remains immaculate.

Dodging the Over-Censorship Trap

If you wrongly delete a mundane post in a massive adult forum like Reddit, you annoy a random user who might just post it again. But if an AI silently deletes a sprawling, passionate 300-word essay a child just typed about their favorite Harry Potter book simply because the bot misread an adjective? You crush their creative spirit completely. They don't try again. They just leave. We violently refuse to be part of that problem.

Our engineering team tracks the "Over-Censorship Rate" with obsessive, almost paranoid focus. In the early days, our safety bots would continually hallucinate "real-world violent threats" during completely innocuous discussions of epic fantasy battles. A kid writing about a wizard casting a fireball was being flagged alongside actual cyberbullying. This forced us to aggressively fine-tune a custom classifier designed strictly to tell the difference between the rich, aggressive literary descriptions of swords clashing and actual, organic abusive intent directed at another human being.

Contextual Context

Raw Semantic Toxicity

🐉 Fantasy Violence

⚠️ Targeted Harm

Figure 2: Separating Actual Threats from Dragons

This graph represents the exact mathematical problem our fine-tuned classifier solves hundreds of times a minute:

The Danger Zone (Bottom Right): These are unambiguous, targeted threats directed at fellow readers. They score incredibly high for raw semantic toxicity and lack any surrounding narrative context to justify the language. These are instantly eliminated.
The Fantasy Zone (Top Middle): This is the cluster containing epic battles, fiery dragons, and dramatic plot summaries. They use "violent" phrasing, but are contextually safe. Standard APIs hopelessly blur the lines between these two zones.
The Insight: We realized we couldn't rely on generic LLMs to draw a straight line between the two. By engineering a much sharper, custom vector boundary, we successfully preserve a child's right to free, dramatic expression while keeping organic harm locked strictly out of the playground.

Safety is Not an Add-on

When you build for kids under 13, safety is not a roadmap item you prioritize for Q3. It is the product. Parents don't trust platforms; they trust proof. We engineered our entire hybrid cascade to be aggressively paranoid yet contextually forgiving. We aren't just filtering bad words—we're building an invisible fence that protects the playground without shrinking it.