Moderation Categories

Sieve evaluates content across seven moderation categories. Each category has a default threshold that determines when content is flagged or blocked. You can customize these thresholds per category to match your platform’s tolerance.

Categories

Toxicity

Default threshold: 0.7

Detects profanity, severe negativity, and hostile language that degrades conversation quality. This is the broadest category and catches general abusive language that doesn’t fit neatly into other categories.

Examples: Excessive profanity directed at others, aggressive hostility, deliberately inflammatory content.

Harassment

Default threshold: 0.7

Identifies personal attacks, bullying, and sustained targeting of individuals. Distinguished from general toxicity by the directed nature of the abuse.

Examples: Repeated personal insults, name-calling targeting a specific user, cyberbullying, dogpiling.

Hate Speech

Default threshold: 0.7

Detects slurs, discrimination, and dehumanizing language targeting protected characteristics (race, ethnicity, gender, sexuality, religion, disability).

Examples: Racial slurs, dehumanizing comparisons, calls for discrimination, Holocaust denial.

Sexual

Default threshold: 0.7

Identifies sexual content, explicit material, and grooming patterns. Includes escalated detection for content that may indicate CSAM (child sexual abuse material).

Examples: Explicit sexual descriptions, unsolicited sexual content, grooming language patterns.

Violence

Default threshold: 0.7

Detects threats of physical harm, doxxing (sharing personal information), and swatting threats. Violence has a lower effective block gap than other categories — it takes less confidence to move from “flag” to “block.”

Examples: Death threats, threats to find someone’s address, swatting threats, detailed descriptions of intended violence.

Self-Harm

Default threshold: 0.5 (lower = stricter)

Identifies content encouraging self-harm or suicide. The default threshold is deliberately lower than other categories, making this the strictest category out of the box.

Examples: “KYS” and variants, suicide encouragement, detailed self-harm instructions, glorification of self-harm.

Spam

Default threshold: 0.8 (higher = more lenient)

Detects real-money trading (RMT), scam links, account selling, boosting services, and flood/repetition spam. The higher default threshold makes this the most lenient category, reducing false positives on legitimate commercial discussion.

Examples: “Selling 10k gold $5 PayPal,” phishing links, repeated identical messages, account trading offers.

Threshold Behavior

Every piece of content receives a score from 0.0 to 1.0 for each category. The score is compared against the threshold to determine the action:

Score vs Threshold	Action	Description
Score < threshold	`allow`	Content passes moderation
Score >= threshold	`flag`	Content is flagged for review
Score >= threshold + block gap	`block`	Content should be blocked

The block gap is the additional confidence needed above the flag threshold to recommend blocking. Block gaps vary by category:

Category	Default Threshold	Block Gap	Block At
Toxicity	0.7	0.15	0.85
Harassment	0.7	0.15	0.85
Hate Speech	0.7	0.10	0.80
Sexual	0.7	0.10	0.80
Violence	0.7	0.08	0.78
Self-Harm	0.5	0.10	0.60
Spam	0.8	0.10	0.90

Context Multipliers

Thresholds are adjusted by a context multiplier based on the content type. The effective threshold is:

effective_threshold = base_threshold * context_multiplier

For example, a username context (0.8x multiplier) with the default toxicity threshold (0.7) yields an effective threshold of 0.56 — meaning usernames are held to a stricter standard.

See Contexts for the full list of context multipliers.

Customizing Thresholds

You can override default thresholds per category when making API calls:

curl -X POST https://api.getsieve.dev/v1/moderate/text \
  -H "Authorization: Bearer mod_live_your_key" \
  -H "Content-Type: application/json" \
  -d '{
    "text": "content to moderate",
    "context": "chat",
    "thresholds": {
      "toxicity": 0.5,
      "spam": 0.6
    }
  }'