How It Works

The Pipeline

Every moderation request enters at Tier 0. If that tier can make a confident decision, it returns immediately. If not, the request escalates to the next tier. Most requests never leave Tier 0.

  Request ─── Tier 0 (Rust/WASM) ──── 60-70% resolved ──── Response
                 │                                            ▲
                 │ uncertain                                  │
                 ▼                                            │
              Tier 1 (AI Classification) ── 20-30% ──────────┘
                 │                                            ▲
                 │ ambiguous                                   │
                 ▼                                            │
              Tier 2 (Claude Haiku) ─────── 5-10% ───────────┘

Local Rust/WASM — Sub-5ms, $0/request. Handles clear-cut cases with pattern matching and term lists.

AI Classification — 100-200ms. Standard AI moderation for content that needs semantic understanding.

Claude Haiku — 300-500ms. Advanced analysis for ambiguous content, custom categories, and edge cases.

Tier 0: Local Processing

Tier 0 is a Rust pipeline compiled to WASM, running on Cloudflare’s edge network. It handles the majority of requests with zero external API calls.

What it does

Unicode normalization — Converts homoglyphs and confusable characters to their base forms. ℌ𝔢𝔩𝔩𝔬 becomes Hello. Catches attempts to bypass filters with special Unicode characters.
Leetspeak decoding — Translates common substitution patterns. h4ck3r, ph1sh1ng, a$$ are decoded before matching.
Aho-Corasick pattern matching — A multi-pattern string matching algorithm that checks content against categorized term lists in a single pass. Faster than checking thousands of individual regexes.
Custom rule evaluation — Your account-specific blocklists, allowlists, and regex patterns are applied here.
Confidence scoring — Each match produces a confidence score. If the aggregate confidence is high enough (above the escalation threshold), Tier 0 returns a result. If not, the request escalates.

When it escalates

Tier 0 passes the request to Tier 1 when:

Content has no strong pattern matches but contains suspicious signals
Confidence score falls between the “clearly safe” and “clearly toxic” thresholds
Content uses heavy obfuscation that pattern matching can’t resolve
The request includes categories that require semantic understanding

Tier 1: AI Classification

Tier 1 uses AI models optimized for content classification. The specific model depends on your pipeline mode:

Mode	Model	Why
`general`	OpenAI omni-moderation	Free, fast, broad category coverage
`gaming`	OpenAI omni-moderation + gaming-tuned prompts	Adjusted thresholds for competitive context
`edge`	skipped	Edge mode is Tier 0 only

Tier 1 handles content that needs semantic understanding — sarcasm, implied threats, coded language, and context-dependent meaning.

What it catches that Tier 0 misses

Implied threats — “I know where your school is” (no explicit violent language)
Coded hate speech — Dog whistles and evolving slang that pattern matching hasn’t indexed
Sarcasm and intent — “Oh sure, everyone should just die” (venting vs. actual threat)
Context-dependent toxicity — Content that’s fine in one context but not another

Tier 2: Advanced Analysis

Tier 2 activates for genuinely ambiguous content — the 5-10% of requests where standard AI classification isn’t confident enough. It uses Claude Haiku for deeper analysis.

When Tier 2 activates

Tier 1 returns a confidence score in the “uncertain” range
Content involves custom categories you’ve defined (e.g., “game exploit discussion”, “spam recruitment”)
Content requires cross-field analysis (checking username + message content together)
Appeal reviews — when a user disputes a moderation decision

What makes it different

Tier 2 doesn’t just classify — it reasons. It considers:

The full context of the content (who said it, where, in response to what)
Your custom category definitions and examples
Cultural and community-specific norms you’ve configured
Whether the content is genuinely ambiguous or a false positive from earlier tiers

How Scoring Works

Every moderation result includes per-category scores between 0.0 (safe) and 1.0 (clear violation).

Category	What it covers
`toxicity`	General toxic language, insults, hostility
`harassment`	Targeted abuse, bullying, personal attacks
`threat`	Threats of violence, physical harm, doxxing
`profanity`	Explicit language, slurs, vulgar content
`sexual`	Sexual content, solicitation, explicit material
`self_harm`	Self-harm, suicide, eating disorders
`hate_speech`	Hate speech targeting protected characteristics
`spam`	Spam, scams, phishing, unsolicited promotion

Thresholds and Actions

Each category has a configurable threshold. When a score exceeds the threshold, that category is flagged. The overall action is determined by the highest-severity flag:

score < threshold           → allow
score >= threshold (any)    → flag (send to human review queue)
score >= block_threshold    → block (auto-reject)

You configure both threshold (flag) and block_threshold (auto-reject) per category in your dashboard or via the config API.

Content Context

The context field tells Sieve what kind of content it’s analyzing. This changes scoring behavior significantly.

Real-time gaming communication. Higher tolerance for competitive language, trash talk, and gaming slang. “Get rekt”, “you’re trash”, “ez clap” score low on toxicity. Actual threats and hate speech are still caught.

{ "context": "gaming_chat" }

User-chosen display names. Stricter scoring — a toxic username is a persistent statement, not a momentary outburst. Checks for leetspeak, Unicode tricks, and hidden slurs.

{ "context": "username" }

Long-form forum content. Considers the full body of text, weighs isolated phrases against overall tone, and handles quoted content appropriately.

{ "context": "forum_post" }

Short comments and replies. Default context. Balanced scoring without domain-specific adjustments.

{ "context": "comment" }

Shadow Mode

Shadow mode lets you run Sieve alongside your existing moderation system without affecting production. Every request is processed by both your current system and Sieve, and you can compare results in the dashboard.

How it works

Send requests to Sieve with "shadow": true in the request body
Sieve processes the content and logs the result, but always returns "action": "allow"
Your existing moderation system makes the actual decision
Compare agreement rates in the dashboard to build confidence before switching

{
  "content": "message to moderate",
  "context": "gaming_chat",
  "shadow": true
}

Validation phases

Sieve uses shadow mode internally to continuously validate Tier 0 accuracy:

Phase	Coverage	When
Launch	100% of Tier 0 results validated by AI tier	First month
Tuning	10% sampling, tighten thresholds where 95%+ agreement	Months 2-3
Steady state	1-5% statistical sampling	Month 4+

This means Tier 0’s accuracy improves over time as the term lists and confidence thresholds are refined against AI ground truth.

Next Steps

Deep dive into general, gaming, and edge modes.

Full list of moderation categories with scoring details.

Complete API documentation with request/response schemas.

Add your own blocklists, allowlists, and regex patterns.