Skip to content

Moderation Categories

Sieve evaluates content across seven moderation categories. Each category has a default threshold that determines when content is flagged or blocked. You can customize these thresholds per category to match your platform’s tolerance.

Default threshold: 0.7

Detects profanity, severe negativity, and hostile language that degrades conversation quality. This is the broadest category and catches general abusive language that doesn’t fit neatly into other categories.

Examples: Excessive profanity directed at others, aggressive hostility, deliberately inflammatory content.

Default threshold: 0.7

Identifies personal attacks, bullying, and sustained targeting of individuals. Distinguished from general toxicity by the directed nature of the abuse.

Examples: Repeated personal insults, name-calling targeting a specific user, cyberbullying, dogpiling.

Default threshold: 0.7

Detects slurs, discrimination, and dehumanizing language targeting protected characteristics (race, ethnicity, gender, sexuality, religion, disability).

Examples: Racial slurs, dehumanizing comparisons, calls for discrimination, Holocaust denial.

Default threshold: 0.7

Identifies sexual content, explicit material, and grooming patterns. Includes escalated detection for content that may indicate CSAM (child sexual abuse material).

Examples: Explicit sexual descriptions, unsolicited sexual content, grooming language patterns.

Default threshold: 0.7

Detects threats of physical harm, doxxing (sharing personal information), and swatting threats. Violence has a lower effective block gap than other categories — it takes less confidence to move from “flag” to “block.”

Examples: Death threats, threats to find someone’s address, swatting threats, detailed descriptions of intended violence.

Default threshold: 0.5 (lower = stricter)

Identifies content encouraging self-harm or suicide. The default threshold is deliberately lower than other categories, making this the strictest category out of the box.

Examples: “KYS” and variants, suicide encouragement, detailed self-harm instructions, glorification of self-harm.

Default threshold: 0.8 (higher = more lenient)

Detects real-money trading (RMT), scam links, account selling, boosting services, and flood/repetition spam. The higher default threshold makes this the most lenient category, reducing false positives on legitimate commercial discussion.

Examples: “Selling 10k gold $5 PayPal,” phishing links, repeated identical messages, account trading offers.

Every piece of content receives a score from 0.0 to 1.0 for each category. The score is compared against the threshold to determine the action:

Score vs ThresholdActionDescription
Score < thresholdallowContent passes moderation
Score >= thresholdflagContent is flagged for review
Score >= threshold + block gapblockContent should be blocked

The block gap is the additional confidence needed above the flag threshold to recommend blocking. Block gaps vary by category:

CategoryDefault ThresholdBlock GapBlock At
Toxicity0.70.150.85
Harassment0.70.150.85
Hate Speech0.70.100.80
Sexual0.70.100.80
Violence0.70.080.78
Self-Harm0.50.100.60
Spam0.80.100.90

Thresholds are adjusted by a context multiplier based on the content type. The effective threshold is:

effective_threshold = base_threshold * context_multiplier

For example, a username context (0.8x multiplier) with the default toxicity threshold (0.7) yields an effective threshold of 0.56 — meaning usernames are held to a stricter standard.

See Contexts for the full list of context multipliers.

You can override default thresholds per category when making API calls:

Terminal window
curl -X POST https://api.getsieve.dev/v1/moderate/text \
-H "Authorization: Bearer mod_live_your_key" \
-H "Content-Type: application/json" \
-d '{
"text": "content to moderate",
"context": "chat",
"thresholds": {
"toxicity": 0.5,
"spam": 0.6
}
}'