modguard implements a layered content moderation approach where each layer catches different types of violations. Rule-based filters handle obvious cases instantly, ML classifiers detect subtle toxicity, and sentiment analysis catches context-dependent negativity. An ensemble engine combines all signals into a single, explainable decision.
Regex patterns, keyword blocklists, and spam detection. Fast, deterministic, zero-cost first pass that catches obvious violations with 100% confidence.
Pre-trained transformer model for multi-label toxicity detection: toxic, severe_toxic, obscene, threat, insult, identity_hate. Returns per-label confidence scores.
Sentiment analysis model that catches subtle negativity, sarcasm, and backhanded compliments that the toxicity model might miss.
Weighted combination of all layer scores into a final decision (APPROVE, FLAG_FOR_REVIEW, or REJECT) with an overall confidence score and human-readable explanation.
No single classifier excels at every type of content violation. Rule-based filters are perfect for known patterns (slurs, spam URLs) but miss novel toxicity. ML models detect nuanced toxicity but are expensive and can have false positives. Sentiment analysis catches negativity that is technically not toxic but still harmful. Layering these approaches covers each other's blind spots, and the ensemble engine weighs their contributions based on reliability.
The ensemble uses weighted scoring: final_score = 0.4 * rule_score + 0.4 * toxicity_score + 0.2 * sentiment_score. Rules and toxicity get equal weight because rules are highly precise (few false positives) and toxicity models capture a broad range of violations. Sentiment gets lower weight because negative sentiment alone does not mean content should be moderated. These weights are configurable per deployment based on the specific community's tolerance levels.
For hate speech and threats, you want high recall (catch everything, even at the cost of some false positives). For mild negativity, you can tolerate lower recall because over-moderation drives users away. The threshold system (score < 0.3 = APPROVE, 0.3-0.7 = FLAG, > 0.7 = REJECT) creates a middle ground: borderline content goes to human review rather than being auto-rejected. This balances user safety with free expression.
Every decision includes a confidence score, which is the weighted average of individual layer confidences. Low-confidence decisions (0.3-0.5) get routed to senior moderators. High-confidence rejections (>0.9) can be auto-actioned. This creates an efficient triage system where human moderators spend their time on genuinely ambiguous cases rather than reviewing obviously safe or obviously toxic content.
from modguard import ModerationPipeline
# Initialize with defaults
pipeline = ModerationPipeline()
# Moderate a single item
result = pipeline.moderate(
"I really enjoyed the new update!"
)
print(result.decision) # APPROVE
print(result.confidence) # 0.95
print(result.explanation)
# Get layer-by-layer breakdown
for layer, result in result.layer_results.items():
print(f"{layer}: {result}")# Ensemble scoring formula
final_score = (
0.4 * rule_score +
0.4 * toxicity_score +
0.2 * sentiment_score
)
# Decision thresholds
if final_score < 0.3:
decision = "APPROVE"
elif final_score < 0.7:
decision = "FLAG_FOR_REVIEW"
else:
decision = "REJECT"Select an example or type text to moderate
Clone the repo, run the demo data generator, and see the dashboard in action.
View on GitHub