How It Works
The Pipeline
Section titled “The Pipeline”Every moderation request enters at Tier 0. If that tier can make a confident decision, it returns immediately. If not, the request escalates to the next tier. Most requests never leave Tier 0.
Request ─── Tier 0 (Rust/WASM) ──── 60-70% resolved ──── Response │ ▲ │ uncertain │ ▼ │ Tier 1 (AI Classification) ── 20-30% ──────────┘ │ ▲ │ ambiguous │ ▼ │ Tier 2 (Claude Haiku) ─────── 5-10% ───────────┘Local Rust/WASM — Sub-5ms, $0/request. Handles clear-cut cases with pattern matching and term lists.
AI Classification — 100-200ms. Standard AI moderation for content that needs semantic understanding.
Claude Haiku — 300-500ms. Advanced analysis for ambiguous content, custom categories, and edge cases.
Tier 0: Local Processing
Section titled “Tier 0: Local Processing”Tier 0 is a Rust pipeline compiled to WASM, running on Cloudflare’s edge network. It handles the majority of requests with zero external API calls.
What it does
Section titled “What it does”-
Unicode normalization — Converts homoglyphs and confusable characters to their base forms.
ℌ𝔢𝔩𝔩𝔬becomesHello. Catches attempts to bypass filters with special Unicode characters. -
Leetspeak decoding — Translates common substitution patterns.
h4ck3r,ph1sh1ng,a$$are decoded before matching. -
Aho-Corasick pattern matching — A multi-pattern string matching algorithm that checks content against categorized term lists in a single pass. Faster than checking thousands of individual regexes.
-
Custom rule evaluation — Your account-specific blocklists, allowlists, and regex patterns are applied here.
-
Confidence scoring — Each match produces a confidence score. If the aggregate confidence is high enough (above the escalation threshold), Tier 0 returns a result. If not, the request escalates.
When it escalates
Section titled “When it escalates”Tier 0 passes the request to Tier 1 when:
- Content has no strong pattern matches but contains suspicious signals
- Confidence score falls between the “clearly safe” and “clearly toxic” thresholds
- Content uses heavy obfuscation that pattern matching can’t resolve
- The request includes categories that require semantic understanding
Tier 1: AI Classification
Section titled “Tier 1: AI Classification”Tier 1 uses AI models optimized for content classification. The specific model depends on your pipeline mode:
| Mode | Model | Why |
|---|---|---|
general | OpenAI omni-moderation | Free, fast, broad category coverage |
gaming | OpenAI omni-moderation + gaming-tuned prompts | Adjusted thresholds for competitive context |
edge | skipped | Edge mode is Tier 0 only |
Tier 1 handles content that needs semantic understanding — sarcasm, implied threats, coded language, and context-dependent meaning.
What it catches that Tier 0 misses
Section titled “What it catches that Tier 0 misses”- Implied threats — “I know where your school is” (no explicit violent language)
- Coded hate speech — Dog whistles and evolving slang that pattern matching hasn’t indexed
- Sarcasm and intent — “Oh sure, everyone should just die” (venting vs. actual threat)
- Context-dependent toxicity — Content that’s fine in one context but not another
Tier 2: Advanced Analysis
Section titled “Tier 2: Advanced Analysis”Tier 2 activates for genuinely ambiguous content — the 5-10% of requests where standard AI classification isn’t confident enough. It uses Claude Haiku for deeper analysis.
When Tier 2 activates
Section titled “When Tier 2 activates”- Tier 1 returns a confidence score in the “uncertain” range
- Content involves custom categories you’ve defined (e.g., “game exploit discussion”, “spam recruitment”)
- Content requires cross-field analysis (checking username + message content together)
- Appeal reviews — when a user disputes a moderation decision
What makes it different
Section titled “What makes it different”Tier 2 doesn’t just classify — it reasons. It considers:
- The full context of the content (who said it, where, in response to what)
- Your custom category definitions and examples
- Cultural and community-specific norms you’ve configured
- Whether the content is genuinely ambiguous or a false positive from earlier tiers
How Scoring Works
Section titled “How Scoring Works”Every moderation result includes per-category scores between 0.0 (safe) and 1.0 (clear violation).
Categories
Section titled “Categories”| Category | What it covers |
|---|---|
toxicity | General toxic language, insults, hostility |
harassment | Targeted abuse, bullying, personal attacks |
threat | Threats of violence, physical harm, doxxing |
profanity | Explicit language, slurs, vulgar content |
sexual | Sexual content, solicitation, explicit material |
self_harm | Self-harm, suicide, eating disorders |
hate_speech | Hate speech targeting protected characteristics |
spam | Spam, scams, phishing, unsolicited promotion |
Thresholds and Actions
Section titled “Thresholds and Actions”Each category has a configurable threshold. When a score exceeds the threshold, that category is flagged. The overall action is determined by the highest-severity flag:
score < threshold → allowscore >= threshold (any) → flag (send to human review queue)score >= block_threshold → block (auto-reject)You configure both threshold (flag) and block_threshold (auto-reject) per category in your dashboard or via the config API.
Content Context
Section titled “Content Context”The context field tells Sieve what kind of content it’s analyzing. This changes scoring behavior significantly.
Real-time gaming communication. Higher tolerance for competitive language, trash talk, and gaming slang. “Get rekt”, “you’re trash”, “ez clap” score low on toxicity. Actual threats and hate speech are still caught.
{ "context": "gaming_chat" }User-chosen display names. Stricter scoring — a toxic username is a persistent statement, not a momentary outburst. Checks for leetspeak, Unicode tricks, and hidden slurs.
{ "context": "username" }Long-form forum content. Considers the full body of text, weighs isolated phrases against overall tone, and handles quoted content appropriately.
{ "context": "forum_post" }Short comments and replies. Default context. Balanced scoring without domain-specific adjustments.
{ "context": "comment" }Shadow Mode
Section titled “Shadow Mode”Shadow mode lets you run Sieve alongside your existing moderation system without affecting production. Every request is processed by both your current system and Sieve, and you can compare results in the dashboard.
How it works
Section titled “How it works”- Send requests to Sieve with
"shadow": truein the request body - Sieve processes the content and logs the result, but always returns
"action": "allow" - Your existing moderation system makes the actual decision
- Compare agreement rates in the dashboard to build confidence before switching
{ "content": "message to moderate", "context": "gaming_chat", "shadow": true}Validation phases
Section titled “Validation phases”Sieve uses shadow mode internally to continuously validate Tier 0 accuracy:
| Phase | Coverage | When |
|---|---|---|
| Launch | 100% of Tier 0 results validated by AI tier | First month |
| Tuning | 10% sampling, tighten thresholds where 95%+ agreement | Months 2-3 |
| Steady state | 1-5% statistical sampling | Month 4+ |
This means Tier 0’s accuracy improves over time as the term lists and confidence thresholds are refined against AI ground truth.
Next Steps
Section titled “Next Steps”Deep dive into general, gaming, and edge modes.
Full list of moderation categories with scoring details.
Complete API documentation with request/response schemas.
Add your own blocklists, allowlists, and regex patterns.