Two-tier safety gates: instant patterns plus an LLM judgment layer

You are about to publish content generated by an AI system to a public surface. The content might be benign. It might be a prompt injection attempt embedded in a user’s seemingly harmless question. It might contain personal information that should never have made it into the model’s output. It might be perfectly fine. You do not get to find out before you ship — you have to decide right now whether to publish.

Most teams pick one strategy and run it everywhere.

Strategy one: pure pattern matching. Run a list of regular expressions against the content. Block on a hit. The advantage is speed and zero cost. The disadvantage is that regex blocks the obvious bad while letting through everything subtle. It is also infamous for false positives — the regex for “credit card numbers” matches order numbers, the regex for “addresses” matches anywhere a number sits next to “Street,” the regex for slurs catches harmless reclaimed text.

Strategy two: send everything to a model and ask “is this safe?” The advantage is nuance — the model understands context, tone, sarcasm, and ambiguity. The disadvantage is that every piece of content now costs a model call. Latency is in seconds, not milliseconds. And the model is non-deterministic: the same content evaluated twice can produce different verdicts, which is its own problem at audit time.

Both are reasonable. Both fail. The right answer is both, in series, with each tier doing the job it is good at. Here is how the architecture works and why it converges to something every production AI system eventually arrives at.

The architecture

Two tiers, evaluated in order. Each tier has a clear job and a clear output.

Tier 1 — instant pattern matching. A small set of regular expressions matched against the content. Hard-blocks on a hit. Runs in microseconds. Free.

Tier 2 — LLM evaluation. A small, focused safety prompt sent to a language model. Returns one of three verdicts: clean, flagged, or blocked. Costs tokens. Takes a second or two.

Tier 1 runs first. If it hard-blocks, you stop — the content does not publish, the LLM is not called. If it passes, the content moves to Tier 2 for the nuanced check.

In practice, Tier 1 catches a small but critical fraction of content — typically the obviously malicious or obviously personally identifying stuff. The vast majority of content sails through Tier 1 and is evaluated by Tier 2. Tier 2 in turn flags or blocks a fraction of that, and the rest publishes. The cost stays bounded because Tier 1 is free and Tier 2’s work is constrained to content that already passed the cheap filter.

What goes in Tier 1

The instinct is to put everything in Tier 1 because it is free. Resist the instinct. Tier 1 should hold only patterns that are both genuine and unambiguous. Three categories qualify, three do not.

Hard yes — these go in Tier 1:

Active code injection. <script tags. javascript: URI schemes. Inline event handlers like onclick=, onerror=. DOM manipulation strings like document.cookie. These are not content; they are an attack on the rendering surface.
Personally identifying patterns with high specificity. US Social Security numbers in their distinctive format. Credit card numbers passing Luhn. Phone numbers in a format your country recognizes and your platform does not need.
Prompt injection canaries. Phrases like “ignore all previous instructions,” <|system|>, <|im_start|>. Things that look like they were written to manipulate a downstream model, not to be read by humans.

_INJECTION_PATTERNS = [
    r"<\s*script[\s>]",
    r"javascript\s*:",
    r"on(load|error|click)\s*=",
    r"document\.(cookie|write)",
]
_PII_PATTERNS = [
    r"\b\d{3}[-.]?\d{2}[-.]?\d{4}\b",
    r"\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b",
]
_PROMPT_INJECTION_PATTERNS = [
    r"ignore\s+(all\s+)?previous\s+instructions",
    r"<\|?(system|im_start|endoftext)\|?>",
]

These are short. They should stay short. The Tier 1 list is something you want to be able to read in one screen. If your Tier 1 grows to a hundred patterns, you have started doing nuance in the wrong layer.

Hard no — these do not go in Tier 1:

Slurs and offensive language. Context determines whether usage is harmful, reclaimed, quoted, or analytical. A regex cannot tell. A model can.
Topic restrictions. “No medical advice.” “No financial advice.” These require understanding what the content is about, which is exactly what models are good at and regexes are not.
Tone judgments. “Too aggressive.” “Too sycophantic.” Tone is a model’s job.

If you find yourself reaching for Tier 1 to handle one of these, you are about to ship a brittle gate with a high false-positive rate. Push it down to Tier 2 instead.

What goes in Tier 2

Tier 2 is a model call. The prompt is small and focused — not the model’s full safety conditioning, but a domain-specific evaluator that knows what your platform considers acceptable. The output is a verdict — one of clean, flagged, or blocked — and a short reason.

SAFETY_PROMPT = """
You are evaluating user-generated content for [platform name].
Your job is to return a verdict: clean, flagged, or blocked.

- clean: the content meets all guidelines, publish without comment.
- flagged: the content is borderline — publish, but log for human review.
- blocked: the content violates a hard rule — do not publish.

Hard rules (block):
1. Doxxing — content that identifies a private individual.
2. Coordinated harm — content advocating violence against a person or group.
3. Sexual content involving minors. Always. No exceptions.

Borderline (flag):
4. Strong profanity not in a quoted/analytical context.
5. Conspiracy framings about identifiable groups.
6. Content that may be a deepfake or impersonation.

Otherwise: clean.

Content:
[content here]

Return JSON: {"verdict": "...", "reason": "..."}.
"""

The model produces a structured output you can act on programmatically.

The three verdicts, explained

Two verdicts feel obvious. Most teams ship clean and blocked and leave it there. The third — flagged — is the one that earns the architecture its keep.

Clean — publish without comment. Most content. The path you want to optimize for.

Blocked — do not publish, log for review. Content that hits a hard rule. Whatever the rule says, this content cannot go live.

Flagged — publish, but log for human review later. Content that is probably fine but the model is not certain about. You publish it now and surface it to a moderator queue. The moderator reviews asynchronously, with no real-time pressure, and either confirms (do nothing) or removes after the fact.

The third verdict is what saves the system from itself.

Without flagged, you have a binary: either you block aggressively, in which case false positives silence half your good content, or you block conservatively, in which case the bad content with subtle violations gets through. Both are bad. With flagged, you can be aggressive in your model’s evaluation and avoid silencing good content. The gate becomes a sieve, not a wall.

This is not just a content-moderation observation. It is a general principle: on a creative platform, false positives are worse than false negatives. A spam item flagged for moderator review in six hours is fine. An interesting item silently blocked at publish kills the platform. Your verdict structure has to make that tradeoff explicit.

What “log for review” actually means

The flag verdict is only as useful as the workflow it triggers. Three concrete pieces.

A flag log. Append every flagged item to a structured log: content snippet, verdict, reason, timestamp, author/source. Keep this somewhere a moderator can scan in a sitting. JSON file, append-only table, whatever — the key property is that it is reviewable in bulk, not item by item.

A moderator UI or query. Whatever surface a moderator uses to triage. Could be a web page. Could be a SQL query. The point is that a human looks at the flag log on a regular cadence — not in real time, but not “when something blows up” either. Daily is the right starting cadence; weekly is the floor.

A reversal path. If a moderator decides flagged content was actually fine, the path back is automatic — no further action. If a moderator decides flagged content was actually a violation, the path is a one-click takedown plus a feedback signal to the model layer. The takedown removes the content from public view; the feedback signal updates the safety evaluator’s prompt or examples for next time.

The point of the flag is not to avoid moderation. It is to defer moderation to a moment when the moderator can think.

The performance budget, honestly

A reasonable rule for the two-tier gate, in production:

Tier 1 should run in well under 100 milliseconds per item, on a single core.
Tier 2 should add at most one model call’s worth of latency — single-digit seconds, depending on model.
Tier 1 should hit on under 1% of items in steady state. If it hits more, your Tier 1 is doing nuance work that belongs in Tier 2.
Tier 2 should flag under 5% of items in steady state. If it flags more, your moderator queue will overflow and the workflow will collapse. Tighten the prompt or accept a faster cadence of moderator review.
Tier 2 should block under 0.5% of items. If it blocks more, your prompt is too aggressive and you are silencing real content.

These numbers are starting points. Your domain will shift them. The shape — Tier 1 hits rare, Tier 2 flags occasional, Tier 2 blocks rare — is what you want to engineer toward.

The pattern beyond content moderation

Content moderation is the most obvious application of the two-tier gate. The pattern generalizes to anything where you have a fast cheap test that is brittle, and a slow expensive test that is nuanced.

Search relevance. Tier 1 cheap inverted-index lookup. Tier 2 expensive semantic re-ranking on the top-K results.
Anomaly detection. Tier 1 simple threshold rules. Tier 2 model-based scoring on the items that pass.
Code review. Tier 1 linter and static analysis. Tier 2 LLM review on what survives.
Resource scheduling. Tier 1 hard quota check. Tier 2 model-based prioritization of what fits.

The architecture is the same. Cheap layer eliminates the obvious cases. Expensive layer handles the rest. The two layers have different failure modes and different cost profiles, and using both gets you better behavior than either alone.

The summary, made dogmatic

If you ship AI-generated content to a public surface, you need a safety gate. If you only have one tier, you have either a brittle one with false positives or an expensive one with bad latency. Two tiers, with a clear division of labor between them, is the architecture that survives contact with production.

Pattern matching for the unambiguous. Model evaluation for the nuanced. Three verdicts — clean, flagged, blocked. A moderator queue that runs on its own cadence. A reversal path for everything published.

That is the whole architecture. Ten lines of regex, one model prompt, an append-only log, a moderator UI. You can build it in a week. Not building it costs more than a week of operational pain the moment something goes wrong publicly.