Skip to content

How It Works

Every moderation request enters at Tier 0. If that tier can make a confident decision, it returns immediately. If not, the request escalates to the next tier. Most requests never leave Tier 0.

Request ─── Tier 0 (Rust/WASM) ──── 60-70% resolved ──── Response
│ ▲
│ uncertain │
▼ │
Tier 1 (AI Classification) ── 20-30% ──────────┘
│ ▲
│ ambiguous │
▼ │
Tier 2 (Claude Haiku) ─────── 5-10% ───────────┘

Local Rust/WASM — Sub-5ms, $0/request. Handles clear-cut cases with pattern matching and term lists.

AI Classification — 100-200ms. Standard AI moderation for content that needs semantic understanding.

Claude Haiku — 300-500ms. Advanced analysis for ambiguous content, custom categories, and edge cases.


Tier 0 is a Rust pipeline compiled to WASM, running on Cloudflare’s edge network. It handles the majority of requests with zero external API calls.

  1. Unicode normalization — Converts homoglyphs and confusable characters to their base forms. ℌ𝔢𝔩𝔩𝔬 becomes Hello. Catches attempts to bypass filters with special Unicode characters.

  2. Leetspeak decoding — Translates common substitution patterns. h4ck3r, ph1sh1ng, a$$ are decoded before matching.

  3. Aho-Corasick pattern matching — A multi-pattern string matching algorithm that checks content against categorized term lists in a single pass. Faster than checking thousands of individual regexes.

  4. Custom rule evaluation — Your account-specific blocklists, allowlists, and regex patterns are applied here.

  5. Confidence scoring — Each match produces a confidence score. If the aggregate confidence is high enough (above the escalation threshold), Tier 0 returns a result. If not, the request escalates.

Tier 0 passes the request to Tier 1 when:

  • Content has no strong pattern matches but contains suspicious signals
  • Confidence score falls between the “clearly safe” and “clearly toxic” thresholds
  • Content uses heavy obfuscation that pattern matching can’t resolve
  • The request includes categories that require semantic understanding

Tier 1 uses AI models optimized for content classification. The specific model depends on your pipeline mode:

ModeModelWhy
generalOpenAI omni-moderationFree, fast, broad category coverage
gamingOpenAI omni-moderation + gaming-tuned promptsAdjusted thresholds for competitive context
edgeskippedEdge mode is Tier 0 only

Tier 1 handles content that needs semantic understanding — sarcasm, implied threats, coded language, and context-dependent meaning.

  • Implied threats — “I know where your school is” (no explicit violent language)
  • Coded hate speech — Dog whistles and evolving slang that pattern matching hasn’t indexed
  • Sarcasm and intent — “Oh sure, everyone should just die” (venting vs. actual threat)
  • Context-dependent toxicity — Content that’s fine in one context but not another

Tier 2 activates for genuinely ambiguous content — the 5-10% of requests where standard AI classification isn’t confident enough. It uses Claude Haiku for deeper analysis.

  • Tier 1 returns a confidence score in the “uncertain” range
  • Content involves custom categories you’ve defined (e.g., “game exploit discussion”, “spam recruitment”)
  • Content requires cross-field analysis (checking username + message content together)
  • Appeal reviews — when a user disputes a moderation decision

Tier 2 doesn’t just classify — it reasons. It considers:

  • The full context of the content (who said it, where, in response to what)
  • Your custom category definitions and examples
  • Cultural and community-specific norms you’ve configured
  • Whether the content is genuinely ambiguous or a false positive from earlier tiers

Every moderation result includes per-category scores between 0.0 (safe) and 1.0 (clear violation).

CategoryWhat it covers
toxicityGeneral toxic language, insults, hostility
harassmentTargeted abuse, bullying, personal attacks
threatThreats of violence, physical harm, doxxing
profanityExplicit language, slurs, vulgar content
sexualSexual content, solicitation, explicit material
self_harmSelf-harm, suicide, eating disorders
hate_speechHate speech targeting protected characteristics
spamSpam, scams, phishing, unsolicited promotion

Each category has a configurable threshold. When a score exceeds the threshold, that category is flagged. The overall action is determined by the highest-severity flag:

score < threshold → allow
score >= threshold (any) → flag (send to human review queue)
score >= block_threshold → block (auto-reject)

You configure both threshold (flag) and block_threshold (auto-reject) per category in your dashboard or via the config API.


The context field tells Sieve what kind of content it’s analyzing. This changes scoring behavior significantly.

Real-time gaming communication. Higher tolerance for competitive language, trash talk, and gaming slang. “Get rekt”, “you’re trash”, “ez clap” score low on toxicity. Actual threats and hate speech are still caught.

{ "context": "gaming_chat" }

User-chosen display names. Stricter scoring — a toxic username is a persistent statement, not a momentary outburst. Checks for leetspeak, Unicode tricks, and hidden slurs.

{ "context": "username" }

Long-form forum content. Considers the full body of text, weighs isolated phrases against overall tone, and handles quoted content appropriately.

{ "context": "forum_post" }

Short comments and replies. Default context. Balanced scoring without domain-specific adjustments.

{ "context": "comment" }

Shadow mode lets you run Sieve alongside your existing moderation system without affecting production. Every request is processed by both your current system and Sieve, and you can compare results in the dashboard.

  1. Send requests to Sieve with "shadow": true in the request body
  2. Sieve processes the content and logs the result, but always returns "action": "allow"
  3. Your existing moderation system makes the actual decision
  4. Compare agreement rates in the dashboard to build confidence before switching
{
"content": "message to moderate",
"context": "gaming_chat",
"shadow": true
}

Sieve uses shadow mode internally to continuously validate Tier 0 accuracy:

PhaseCoverageWhen
Launch100% of Tier 0 results validated by AI tierFirst month
Tuning10% sampling, tighten thresholds where 95%+ agreementMonths 2-3
Steady state1-5% statistical samplingMonth 4+

This means Tier 0’s accuracy improves over time as the term lists and confidence thresholds are refined against AI ground truth.


Deep dive into general, gaming, and edge modes.

Full list of moderation categories with scoring details.

Complete API documentation with request/response schemas.

Add your own blocklists, allowlists, and regex patterns.