Moderation Categories
Sieve evaluates content across seven moderation categories. Each category has a default threshold that determines when content is flagged or blocked. You can customize these thresholds per category to match your platform’s tolerance.
Categories
Section titled “Categories”Toxicity
Section titled “Toxicity”Default threshold: 0.7
Detects profanity, severe negativity, and hostile language that degrades conversation quality. This is the broadest category and catches general abusive language that doesn’t fit neatly into other categories.
Examples: Excessive profanity directed at others, aggressive hostility, deliberately inflammatory content.
Harassment
Section titled “Harassment”Default threshold: 0.7
Identifies personal attacks, bullying, and sustained targeting of individuals. Distinguished from general toxicity by the directed nature of the abuse.
Examples: Repeated personal insults, name-calling targeting a specific user, cyberbullying, dogpiling.
Hate Speech
Section titled “Hate Speech”Default threshold: 0.7
Detects slurs, discrimination, and dehumanizing language targeting protected characteristics (race, ethnicity, gender, sexuality, religion, disability).
Examples: Racial slurs, dehumanizing comparisons, calls for discrimination, Holocaust denial.
Sexual
Section titled “Sexual”Default threshold: 0.7
Identifies sexual content, explicit material, and grooming patterns. Includes escalated detection for content that may indicate CSAM (child sexual abuse material).
Examples: Explicit sexual descriptions, unsolicited sexual content, grooming language patterns.
Violence
Section titled “Violence”Default threshold: 0.7
Detects threats of physical harm, doxxing (sharing personal information), and swatting threats. Violence has a lower effective block gap than other categories — it takes less confidence to move from “flag” to “block.”
Examples: Death threats, threats to find someone’s address, swatting threats, detailed descriptions of intended violence.
Self-Harm
Section titled “Self-Harm”Default threshold: 0.5 (lower = stricter)
Identifies content encouraging self-harm or suicide. The default threshold is deliberately lower than other categories, making this the strictest category out of the box.
Examples: “KYS” and variants, suicide encouragement, detailed self-harm instructions, glorification of self-harm.
Default threshold: 0.8 (higher = more lenient)
Detects real-money trading (RMT), scam links, account selling, boosting services, and flood/repetition spam. The higher default threshold makes this the most lenient category, reducing false positives on legitimate commercial discussion.
Examples: “Selling 10k gold $5 PayPal,” phishing links, repeated identical messages, account trading offers.
Threshold Behavior
Section titled “Threshold Behavior”Every piece of content receives a score from 0.0 to 1.0 for each category. The score is compared against the threshold to determine the action:
| Score vs Threshold | Action | Description |
|---|---|---|
| Score < threshold | allow | Content passes moderation |
| Score >= threshold | flag | Content is flagged for review |
| Score >= threshold + block gap | block | Content should be blocked |
The block gap is the additional confidence needed above the flag threshold to recommend blocking. Block gaps vary by category:
| Category | Default Threshold | Block Gap | Block At |
|---|---|---|---|
| Toxicity | 0.7 | 0.15 | 0.85 |
| Harassment | 0.7 | 0.15 | 0.85 |
| Hate Speech | 0.7 | 0.10 | 0.80 |
| Sexual | 0.7 | 0.10 | 0.80 |
| Violence | 0.7 | 0.08 | 0.78 |
| Self-Harm | 0.5 | 0.10 | 0.60 |
| Spam | 0.8 | 0.10 | 0.90 |
Context Multipliers
Section titled “Context Multipliers”Thresholds are adjusted by a context multiplier based on the content type. The effective threshold is:
effective_threshold = base_threshold * context_multiplierFor example, a username context (0.8x multiplier) with the default toxicity threshold (0.7) yields an effective threshold of 0.56 — meaning usernames are held to a stricter standard.
See Contexts for the full list of context multipliers.
Customizing Thresholds
Section titled “Customizing Thresholds”You can override default thresholds per category when making API calls:
curl -X POST https://api.getsieve.dev/v1/moderate/text \ -H "Authorization: Bearer mod_live_your_key" \ -H "Content-Type: application/json" \ -d '{ "text": "content to moderate", "context": "chat", "thresholds": { "toxicity": 0.5, "spam": 0.6 } }'