Automated Bayesian Spam/Ham Training with Rspamd

Published March 11th, 2020 by Philip Iezzi • 7 min read

# rspamd # dovecot # python # email # sysadmin

At Onlime GmbH we have migrated the mail infrastructure in Dec 2019 from good old Spamassassin to Rspamd which greatly improved spam filtering. Rspamd offers a nice way of Bayesian learning in Rspamd statistical module. You can feed emails through rspamc learn_spam or rspamc learn_ham for manual spam/ham training to improve Bayes hit rate.

In the past, we have only internally used Bayesian training. Wouldn't it be nice to let all customers help us improve the Bayes filter / hit rate? Without even asking them to do so?

In this article, I am going to explain how to set up automated spam/ham Bayes learning for your mail infrastructure, containing the following components:

Dovecot IMAP server with Pigeonhole: IMAPSieve Plugin
A mail account spam@example.com for spam/ham learning
Global Sieve scripts on IMAP server that feed spam/ham into that mail account
Rspamd with enabled statistics module (Bayes)
onlime/rspamd-trainer doing the actual spam/ham learning with rspamc learn_{spam|ham}

Further down, I explain the "magic", how our customers are helping improve our Bayes filtering without even noticing.

Set up rspamd-trainer

I have written a small helper script in Python that grabs messages from a mailbox via IMAP and feeds them to Rspamd for spam/ham learning:

onlime/rspamd-trainer

First, create a mailaccount for spam learning (using spam@example.com as reference in this article) with at least the following folders:

INBOX
├── report_ham
├── report_spam
└── report_spam_reply

Additional INBOX/learned_* folders will be created by rspamd-trainer upon first moved emails, if they don't exist. rspamd-trainer grabs emails from report_* folders and moves them to learned_* folders once successfully processed.

Now, install onlime/rspamd-trainer on the same mailserver where Rspamd is running (does not need to run on your IMAP server):

$ cd /opt
$ git clone git@gitlab.com:onlime/rspamd-trainer.git
$ cd rspamd-trainer
$ python3 -m venv venv
$ . venv/bin/activate
(venv) $ pip install -r requirements.txt

Configuration is stored in .env. See .env.defaults for default config options. Put the credentials for spam@example.com mailaccount into that file:

.env

HOST=localhost
USERNAME=spam@example.com
PASSWORD=xxxxxxxxxxxxxxxx
INBOXPREFIX=INBOX/

rspamd-trainer is now ready for work and you can already set up a cronjob that runs e.g. every 5mins:

*/5 * * * * /opt/rspamd-trainer/venv/bin/python run.py

Now, copy a spam mail that has not been classified as spam by Rspamd into INBOX/report_spam on spam@example.com and monitor rspamd-trainer's log:

$ tail -f log/application.log

Sample log lines:

2020-03-11 11:55:01,872 INFO (4e8728ad) - INBOX/report_spam:40484 From:"Badcompany GmbH" <spammer@example.com> To:<contact@example.com> Message-ID: <ff093c4a...>
2020-03-11 11:55:01,873 INFO (4e8728ad) - running rspamc learn_spam ...
2020-03-11 11:55:01,965 INFO (4e8728ad) - rspamc output:
Results for file: stdin (0.035 seconds)
success = true;
filename = "stdin";
scan_time = 0.035999;

Great, rspamd-trainer does its job!

Automated Bayes Learning

So far, we did not depend on any specific IMAP mailserver implementation, as rspamd-trainer simply connects over IMAP to connect to our spam mailbox for Bayes learning. Now, it gets more vendor specific and I am only going to present you a solution for Dovecot, as Dovecot simply is the best open source IMAP server out there - glad we finally ditched Cyrus-IMAPd in 2019!

For automated spam/ham learning via Dovecot/IMAPSieve, first study the following tutorial for a quick overview:

Rspamd: Getting feedback from users with IMAPSieve

The main idea: Whenever a customer/ mailaccount user is moving an email into his Spam folder, we assume this was an email which was not detected as spam and should be learned as "spam". Whenever he moves an email from his Spam folder into any other folder (other than Trash), we assume this was a false-positive and should be learned as "ham".

Below implementation with global Sieve scripts copies such emails to our spam learning mailbox (in spam@example.com mailaccount we have set up, see above):

Copy an email to the report_spam mailbox if a user copies it from elsewhere to his Spam folder or if a flag is changed on an email in Spam folder.
Copy an email to the report_ham mailbox if a user copies it from his Spam folder to elsewhere.

Spam/Ham learning is triggered via Dovecot/IMAPSieve configuration in conf.d/90-sieve.conf:

conf.d/90-sieve.conf

plugin {
  # ...

  ###
  ### Spam learning with IMAPSieve
  ### Note: MUAs may move message with COPY or APPEND (MS Outlook) (IMAP) command.
  ###
  # Spam: From elsewhere to Spam folder or flag changed in Spam folder
  imapsieve_mailbox1_name = INBOX/Spam
  imapsieve_mailbox1_causes = COPY APPEND FLAG
  imapsieve_mailbox1_before = file:/var/lib/dovecot/sieve/learn-spam.sieve

  # Ham: From Spam folder to elsewhere
  imapsieve_mailbox2_name = *
  imapsieve_mailbox2_from = INBOX/Spam
  imapsieve_mailbox2_causes = COPY
  imapsieve_mailbox2_before = file:/var/lib/dovecot/sieve/learn-ham.sieve

  # ...
}

Global learn-spam.sieve that cares about spam learning and writes log lines with learn-spam keyword:

learn-spam.sieve

require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "imap4flags", "vnd.dovecot.debug", "variables"];

# Logging
if address :matches "from" "*" { set "FROM" "${1}"; }
if address :matches "to" "*" { set "TO" "${1}"; }
if header :matches "subject" "*" { set "SUBJECT" "${1}"; }
if header :matches "Message-ID" "*" { set "MSGID" "${1}"; }
if header :matches "X-Spamd-Result" "*" { set "XSpamdResult" "${1}"; }
if environment :matches "imap.cause" "*" { set "IMAPCAUSE" "${1}"; }
debug_log "learn-spam.sieve was triggered on imap.cause=${IMAPCAUSE}: msgid=${MSGID}";
set "LogMsg" "learn-spam on imap.cause=${IMAPCAUSE}: from=${FROM}, to=${TO}, subject=${SUBJECT}, msgid=${MSGID}, X-Spamd-Result=${XSpamdResult}";

# Spam-learning by storing a copy of the message into spam@example.com
if anyof (environment :is "imap.cause" "COPY", environment :is "imap.cause" "APPEND") {
    debug_log "${LogMsg}";
    debug_log "learn-spam copy to INBOX/report_spam";
    pipe :copy "dovecot-lda" [ "-d", "spam@example.com", "-m", "INBOX/report_spam" ];
}
# Catch replied or forwarded spam
elsif anyof (allof (hasflag "\\Answered", environment :contains "imap.changedflags" "\\Answered"),
             allof (hasflag "$Forwarded", environment :contains "imap.changedflags" "$Forwarded")) {
    debug_log "${LogMsg}";
    debug_log "learn-spam copy to INBOX/report_spam_reply";
    pipe :copy "dovecot-lda" [ "-d", "spam@example.com", "-m", "INBOX/report_spam_reply" ];
}

Global learn-ham.sieve that cares about ham learning and writes log lines with learn-ham keyword:

learn-ham.sieve

require ["vnd.dovecot.pipe", "copy", "imapsieve", "environment", "variables", "vnd.dovecot.debug"];

# Exclude messages which were moved to Trash (or training mailboxes) from ham learning
if environment :matches "imap.mailbox" "*" {
    set "mailbox" "${1}";
}
if string "${mailbox}" [ "INBOX/Trash", "INBOX/Deleted Items", "INBOX/Bin", "INBOX/train_ham", "INBOX/train_prob", "INBOX/train_spam" ] {
    stop;
}

# Logging
if address :matches "from" "*" { set "FROM" "${1}"; }
if address :matches "to" "*" { set "TO" "${1}"; }
if header :matches "subject" "*" { set "SUBJECT" "${1}"; }
if header :matches "Message-ID" "*" { set "MSGID" "${1}"; }
if header :matches "X-Spamd-Result" "*" { set "XSpamdResult" "${1}"; }
if environment :matches "imap.cause" "*" { set "IMAPCAUSE" "${1}"; }
debug_log "learn-ham on imap.cause=${IMAPCAUSE}: from=${FROM}, to=${TO}, subject=${SUBJECT}, msgid=${MSGID}, X-Spamd-Result=${XSpamdResult}";

# Ham-learning by storing a copy of the message into spam@example.com
debug_log "learn-ham copy to INBOX/report_ham";
pipe :copy "dovecot-lda" [ "-d", "spam@example.com", "-m", "INBOX/report_ham" ];

Prepare Dovecot and compile global Sieve scripts:

$ ln -s /usr/lib/dovecot/dovecot-lda /usr/local/sbin/dovecot-lda
$ sievec /var/lib/dovecot/sieve/learn-spam.sieve
$ sievec /var/lib/dovecot/sieve/learn-ham.sieve

Once this is all set up, monitor mail.log:

$ tail -f /var/log/mail.log | grep -E 'sieve.*learn'

Emails are now copied from all mailaccounts to the spam/ham learning mailbox on spam@example.com and rspamd-trainer feeds it to Rspamd for Bayesian learning. If ever you want to reverse learning, simply move the (wrongly learned) emails from any learned_* folder to report_* folder.

Maintenance

Make sure you set quota for the global mailaccount spam@example.com high enough, so that you never run out of storage space. Also make sure you protect that account with a strong password and never hand those credentials out to anybody else than the sysadmin of your mail infrastruture (geeky you, as probably you wouldn't have read this article if it was somebody else), as this mailaccount may contain sensible emails from your friends/customers.

And, once in a while, do some cleanup and e.g. remove any learned messages that are older than 90 days, using doveadm on your Dovecot mailserver:

# check number of messages and mailbox size of INBOX/learned_spam
$ doveadm mailbox status -u spam@example.com 'messages vsize' INBOX/learned_spam
# remove any emails older than 90d from INBOX/learned_spam
$ doveadm expunge -u spam@example.com mailbox INBOX/learned_spam BEFORE 90d > /dev/null 2>&1

# check number of messages and mailbox size of INBOX/learned_ham
$ doveadm mailbox status -u spam@example.com 'messages vsize' INBOX/learned_ham
# remove any emails older than 90d from INBOX/learned_ham
$ doveadm expunge -u spam@example.com mailbox INBOX/learned_ham BEFORE 90d > /dev/null 2>&1

You might automate this cleanup task with a script that you run in a cronjob or Systemd timer.

Happy learning and never stop fighting spam!

Author: Philip Iezzi (Pipo)

Owner of Onlime GmbH - providing quality webhosting with love. All into system engineering, Linux sysadmin, security, full stack web development, mountain biking, slacklining, dancing & feeling connected to nature.

Previous December 22nd, 2020

MySQL MyISAM to InnoDB Conversion

August 6th, 2018 Next

Process hiding in LXC using hidepid capabilities of procfs