modguard

A Multi-Layered AI Content Moderation Pipeline

Overview

modguard implements a layered content moderation approach where each layer catches different types of violations. Rule-based filters handle obvious cases instantly, ML classifiers detect subtle toxicity, and sentiment analysis catches context-dependent negativity. An ensemble engine combines all signals into a single, explainable decision.

Layered Architecture

Layer 1: Rule-Based Filters

Regex patterns, keyword blocklists, and spam detection. Fast, deterministic, zero-cost first pass that catches obvious violations with 100% confidence.

Layer 2: ML Toxicity Classifier

Pre-trained transformer model for multi-label toxicity detection: toxic, severe_toxic, obscene, threat, insult, identity_hate. Returns per-label confidence scores.

Layer 3: Sentiment Analyzer

Sentiment analysis model that catches subtle negativity, sarcasm, and backhanded compliments that the toxicity model might miss.

Ensemble Decision Engine

Weighted combination of all layer scores into a final decision (APPROVE, FLAG_FOR_REVIEW, or REJECT) with an overall confidence score and human-readable explanation.

All layer scores combine in the Ensemble Decision Engine

Design Decisions

Why Layered Moderation Outperforms Single Models

No single classifier excels at every type of content violation. Rule-based filters are perfect for known patterns (slurs, spam URLs) but miss novel toxicity. ML models detect nuanced toxicity but are expensive and can have false positives. Sentiment analysis catches negativity that is technically not toxic but still harmful. Layering these approaches covers each other's blind spots, and the ensemble engine weighs their contributions based on reliability.

Ensemble Weighting and Tuning

The ensemble uses weighted scoring: final_score = 0.4 * rule_score + 0.4 * toxicity_score + 0.2 * sentiment_score. Rules and toxicity get equal weight because rules are highly precise (few false positives) and toxicity models capture a broad range of violations. Sentiment gets lower weight because negative sentiment alone does not mean content should be moderated. These weights are configurable per deployment based on the specific community's tolerance levels.

Precision vs Recall in Content Moderation

For hate speech and threats, you want high recall (catch everything, even at the cost of some false positives). For mild negativity, you can tolerate lower recall because over-moderation drives users away. The threshold system (score < 0.3 = APPROVE, 0.3-0.7 = FLAG, > 0.7 = REJECT) creates a middle ground: borderline content goes to human review rather than being auto-rejected. This balances user safety with free expression.

Confidence Scoring for Human Moderators

Every decision includes a confidence score, which is the weighted average of individual layer confidences. Low-confidence decisions (0.3-0.5) get routed to senior moderators. High-confidence rejections (>0.9) can be auto-actioned. This creates an efficient triage system where human moderators spend their time on genuinely ambiguous cases rather than reviewing obviously safe or obviously toxic content.

Code Examples

Basic Moderation Usage

python

from modguard import ModerationPipeline

# Initialize with defaults
pipeline = ModerationPipeline()

# Moderate a single item
result = pipeline.moderate(
    "I really enjoyed the new update!"
)
print(result.decision)      # APPROVE
print(result.confidence)    # 0.95
print(result.explanation)

# Get layer-by-layer breakdown
for layer, result in result.layer_results.items():
    print(f"{layer}: {result}")

Ensemble Decision Logic

python

# Ensemble scoring formula
final_score = (
    0.4 * rule_score +
    0.4 * toxicity_score +
    0.2 * sentiment_score
)

# Decision thresholds
if final_score < 0.3:
    decision = "APPROVE"
elif final_score < 0.7:
    decision = "FLAG_FOR_REVIEW"
else:
    decision = "REJECT"

Interactive Demo

Select an example or type text to moderate

Explore the Full Source Code

Clone the repo, run the demo data generator, and see the dashboard in action.

View on GitHub

modguard

A Multi-Layered AI Content Moderation Pipeline

View on GitHub

Overview

Layered Architecture

Layer 1: Rule-Based Filters

Regex patterns, keyword blocklists, and spam detection. Fast, deterministic, zero-cost first pass that catches obvious violations with 100% confidence.

Layer 2: ML Toxicity Classifier

Pre-trained transformer model for multi-label toxicity detection: toxic, severe_toxic, obscene, threat, insult, identity_hate. Returns per-label confidence scores.

Layer 3: Sentiment Analyzer

Sentiment analysis model that catches subtle negativity, sarcasm, and backhanded compliments that the toxicity model might miss.

Ensemble Decision Engine

Weighted combination of all layer scores into a final decision (APPROVE, FLAG_FOR_REVIEW, or REJECT) with an overall confidence score and human-readable explanation.

All layer scores combine in the Ensemble Decision Engine

Design Decisions

Why Layered Moderation Outperforms Single Models

Ensemble Weighting and Tuning

Precision vs Recall in Content Moderation

Confidence Scoring for Human Moderators

Code Examples

Basic Moderation Usage

python

from modguard import ModerationPipeline

# Initialize with defaults
pipeline = ModerationPipeline()

# Moderate a single item
result = pipeline.moderate(
    "I really enjoyed the new update!"
)
print(result.decision)      # APPROVE
print(result.confidence)    # 0.95
print(result.explanation)

# Get layer-by-layer breakdown
for layer, result in result.layer_results.items():
    print(f"{layer}: {result}")

Ensemble Decision Logic

python

# Ensemble scoring formula
final_score = (
    0.4 * rule_score +
    0.4 * toxicity_score +
    0.2 * sentiment_score
)

# Decision thresholds
if final_score < 0.3:
    decision = "APPROVE"
elif final_score < 0.7:
    decision = "FLAG_FOR_REVIEW"
else:
    decision = "REJECT"

Interactive Demo

Select an example or type text to moderate

Explore the Full Source Code

Clone the repo, run the demo data generator, and see the dashboard in action.

View on GitHub