Spam emails are not just annoying—they’re a security risk and a drain on server resources. While basic spam filters catch obvious junk, Bayesian filtering provides intelligent, self-learning protection that improves over time. This guide will walk you through implementing Bayesian filtering with SpamAssassin on both Red Hat (RHEL/CentOS/AlmaLinux/Rocky) and Ubuntu servers.

What is Bayesian Filtering?

Bayesian filtering uses probability theory to determine if an email is spam. It analyzes the words and patterns in emails, learning from what you mark as spam or ham (legitimate mail). The more you train it, the smarter it gets.

Prerequisites

  • Root or sudo access to your server
  • Postfix or other MTA already configured
  • Basic understanding of email flow

Part 1: Installing SpamAssassin On Ubuntu/Debian

# Update package list
apt update

# Install SpamAssassin and related tools
apt install spamassassin spamc sa-learn -y

# Enable the service to start on boot
systemctl enable spamassassin

On Red Hat/CentOS/AlmaLinux/Rocky

# Install EPEL repository (if not already enabled)
dnf install epel-release -y

# Install SpamAssassin
dnf install spamassassin spamc -y

# Enable the service
systemctl enable spamassassin

Part 2: Initial Configuration

2.1 Configure SpamAssassin


Edit the main configuration file:

nano /etc/mail/spamassassin/local.cf

Add these basic settings:

# Required score to mark as spam (lower = more aggressive)
required_score 5.0

# Rewrite subject line for spam
rewrite_header subject *****SPAM*****

# Use Bayesian filtering
use_bayes 1
bayes_auto_learn 1
bayes_auto_learn_threshold_spam 7.0
bayes_auto_learn_threshold_ham 0.5

# Enable network tests
skip_rbl_checks 0
use_razor2 1
use_dcc 1
use_pyzor 1

# DNS blocklists
use_dnsbl 1
dns_available test: 8.8.8.8
score DNSBL 3.0

# Whitelist and blacklist
# whitelist_from *@yourdomain.com
# blacklist_from *@known-spam-domain.com

# Additional rules
ok_languages all
ok_locales all

2.2 Configure SpamAssassin to Work with Postfix

Edit Postfix master configuration:

nano /etc/postfix/master.cf

Add or uncomment these lines:

smtp      inet  n       -       y       -       -       smtpd
  -o content_filter=spamassassin

spamassassin unix -     n       n       -       -       pipe
  flags=Rq user=spamd argv=/usr/bin/spamc -f -e /usr/sbin/sendmail -oi -f ${sender} ${recipient}

2.3 Start SpamAssassin

# Start the service
systemctl start spamassassin

# Restart Postfix to apply changes
systemctl restart postfix

# Check status
systemctl status spamassassin


Part 3: Initializing the Bayesian Database

3.1 Create the Bayes Directory

mkdir -p /root/.spamassassin
chmod 750 /root/.spamassassin

3.2 Initialize with a Test Message

echo "This is a normal test email from my server. It contains regular text that should be considered ham." | sa-learn --ham

3.3 Verify Database Creation

# Check database files
ls -la /root/.spamassassin/

# View database statistics
sa-learn --dump magic

Expected output:

0.000          0          3          0  non-token data: bayes db version
0.000          0          0          0  non-token data: nspam
0.000          0          1          0  non-token data: nham
0.000          0         21          0  non-token data: ntokens

Part 4: Training the Filter

4.1 Create Training Script

Create a script to automate training:

nano /usr/local/bin/sa-learn-weekly.sh

Paste this content (adjust paths for your mail setup):

#!/bin/bash
# Bayesian filter training script

LOG_FILE="/var/log/sa-learn.log"
DATE=$(date "+%Y-%m-%d %H:%M:%S")

echo "[$DATE] Starting Bayes training..." >> $LOG_FILE

# Train spam from all users' Spam folders
if [ -d "/var/qmail/mailnames" ]; then
    # Plesk/Qmail structure
    find /var/qmail/mailnames -path "*/.Spam/cur" -type d 2>/dev/null | while read spamdir; do
        count=$(ls "$spamdir" | wc -l)
        echo "  Training spam from $spamdir ($count messages)" >> $LOG_FILE
        sa-learn --spam "$spamdir" --quiet 2>/dev/null
    done
    
    # Train ham from inboxes (excluding Spam folders)
    find /var/qmail/mailnames -path "*/cur" -type d 2>/dev/null | grep -v "/\\.Spam/" | while read inbox; do
        count=$(ls "$inbox" | wc -l)
        echo "  Training ham from $inbox ($count messages)" >> $LOG_FILE
        sa-learn --ham "$inbox" --quiet 2>/dev/null
    done
elif [ -d "/var/mail" ]; then
    # Standard mail directory structure
    for user in $(ls /var/mail); do
        if [ -d "/var/mail/$user/.Spam" ]; then
            sa-learn --spam /var/mail/$user/.Spam --quiet
        fi
        if [ -d "/var/mail/$user/cur" ]; then
            sa-learn --ham /var/mail/$user/cur --quiet
        fi
    done
fi

# Log results
echo "  Training complete. Database stats:" >> $LOG_FILE
sa-learn --dump magic >> $LOG_FILE
echo "" >> $LOG_FILE

Make it executable:

chmod +x /usr/local/bin/sa-learn-weekly.sh

4.2 Schedule Automatic Training

Add to crontab (runs every Sunday at 3 AM):

# Edit crontab
crontab -e

# Add this line:
0 3 * * 0 /usr/local/bin/sa-learn-weekly.sh

4.3 Train Existing Mail (Optional)

If you have existing mail folders, train them immediately:

# For Plesk servers
for spamdir in $(find /var/qmail/mailnames -path "*/.Spam/cur" -type d 2>/dev/null); do
    echo "Training spam from: $spamdir"
    sa-learn --spam "$spamdir" --progress
done

for inbox in $(find /var/qmail/mailnames -path "*/cur" -type d 2>/dev/null | grep -v "/\\.Spam/"); do
    echo "Training ham from: $inbox"
    sa-learn --ham "$inbox" --progress
done


Part 5: Advanced Configuration

5.1 Adjust Spam Sensitivity Edit /etc/mail/spamassassin/local.cf:

# More aggressive (lower score)
required_score 3.5

# Less aggressive (higher score)
required_score 7.5

# Bayesian thresholds
bayes_auto_learn_threshold_spam 6.0
bayes_auto_learn_threshold_ham 0.1


5.2 Add Custom Blocklists

# Add to local.cf
blacklist_from *@*.xyz
blacklist_from *@*.top
blacklist_from *@*.bid
blacklist_from *@*.work
blacklist_from *@*.date
blacklist_from *@*.win

5.3 Create User-Specific Whitelists

# For a specific user
mkdir -p /home/user/.spamassassin
echo "whitelist_from trusted@domain.com" >> /home/user/.spamassassin/user_prefs

Part 6: Monitoring and Maintenance

6.1 Check Bayesian Database Status

# View current statistics
sa-learn --dump magic

Expected output:

0.000          0       3103          0  non-token data: nham
0.000          0        456          0  non-token data: nspam
0.000          0     207642          0  non-token data: ntokens

6.2 Monitor Spam in Real-Time

# Watch mail logs
tail -f /var/log/mail.log | grep -E "spamd: result"

# Check spam scores
grep "spamd: result" /var/log/mail.log | tail -20

6.3 Database Maintenance

Periodically expire old tokens:

# Manually expire old data
sa-learn --force-expire

# Check after expiry
sa-learn --dump magic


6.4 Create Monitoring Script

nano /usr/local/bin/check-bayes.sh
#!/bin/bash
# Check Bayes health

STATS=$(sa-learn --dump magic)
SPAM=$(echo "$STATS" | grep nspam | awk '{print $5}')
HAM=$(echo "$STATS" | grep nham | awk '{print $5}')
TOKENS=$(echo "$STATS" | grep ntokens | awk '{print $5}')

echo "Bayes Status:"
echo "  Ham: $HAM messages"
echo "  Spam: $SPAM messages"
echo "  Tokens: $TOKENS"

if [ "$SPAM" -lt 200 ]; then
    echo "⚠️  Warning: Only $SPAM spam messages learned (need 200+)"
fi

if [ "$HAM" -lt 200 ]; then
    echo "⚠️  Warning: Only $HAM ham messages learned (need 200+)"
fi

Make it executable:

chmod +x /usr/local/bin/check-bayes.sh

Part 7: Testing Your Setup

7.1 Test Spam Detection

# Create a test spam message
echo "XJS*C4JDBQADN1.NSBN3*2IDNEN*GTUBE-STANDARD-ANTI-UBE-TEST-EMAIL*C.34X" | spamc

You should see a high spam score.

7.2 Test Ham Detection

# Create a test ham message
echo "Dear colleague, please find the quarterly report attached." | spamc

Should return a low score.

Troubleshooting Common Issues and Solutions

Issue: Bayes database not created

# Create manually
mkdir -p /root/.spamassassin
sa-learn --sync

Issue: Permission denied

# Fix permissions
chown -R spamd:spamd /root/.spamassassin
chmod 750 /root/.spamassassin

Issue: SpamAssassin not starting

# Check logs
journalctl -u spamassassin -f

# Test configuration
spamassassin --lint

Issue: No spam being detected

# Check required score
grep required_score /etc/mail/spamassassin/local.cf

# Lower it if too high
sed -i 's/required_score .*/required_score 3.5/' /etc/mail/spamassassin/local.cf
systemctl restart spamassassin


Best Practices Summary

PracticeRecommendation
Initial trainingStart with at least 200 ham and 200 spam
Ongoing trainingWeekly automatic training
SensitivityStart with 5.0, adjust based on results
MonitoringCheck stats monthly
BackupBackup /root/.spamassassin regularly
UpdatesKeep SpamAssassin updated

Conclusion

Bayesian filtering is one of the most effective ways to combat spam. Unlike static rules, it adapts to your specific email patterns and improves over time. With this setup, your server will:

✅ Learn from every email it processes
✅ Improve accuracy over time
✅ Reduce false positives
✅ Catch more spam with fewer resources
✅ Require minimal maintenance

The initial training period takes a few weeks, but once your database reaches 1,000+ spam and ham messages, you’ll see excellent results. Combined with DNS blocklists and regular updates, this provides enterprise-grade spam protection for your servers.


Remember: The key to effective Bayesian filtering is consistent training. Set up the cron job, let it run, and watch your spam detection improve month after month.