dwww Home | Show directory contents | Find package

It seems that training bogofilter on its errors _only_ is a very good
way to train, at least with the Robinson-Fisher or Bayes chain rule
calculation methods.  The way this works is: messages from the training
corpus are picked at random (without replacement, ie no message is used
more than once) and fed to bogofilter for evaluation.  If bogofilter
gets the classification right, nothing further is done.  If it's wrong,
or uncertain if ternary mode is in use, then the message is fed to
bogofilter again with the -s or -n option, as appropriate.

That's all very well, except that it's not an easy process to execute
with just a couple of shell commands.  I've now written a bash script
that does the job [Matthias Andree: this has been changed so it may now
work on a regular POSIX compliant sh, too, feedback is welcome]; you
give it the directory in which to build the bogofilter database, and a
list of files flagged with either -s or -n to indicate spam or nonspam,
and it performs training-on-error using all the messages in all the
files in random order.

My production version of bogofilter returns the following exit codes:
0 for spam
1 for nonspam
2 for uncertain
3 for error

Normal bogofilter returns (I think)
0 for spam
1 for nonspam
2 for error

This script will work with either.

You can use it to build from scratch; the first message evaluated will
return the error exit code, and randomtrain (as this script is called)
will train with that message, thus creating the databases.

The script needs rather a lot of auxiliary commands (they're listed in
the comments at the top of the file); in particular, perl is called for
the randomization function.  (The embedded perl script is "useful" in
its own right: it takes text on standard input and returns the lines in
random sequence.)  Known portability issue: on HP-UX (10.20 at least),
grep -b returns a block offset instead of a byte offset, so randomtrain
won't work unless gnu grep is substituted for the HP-UX one.

I rebuilt my training lists with randomtrain.  The training corpus
consists of 9878 spams and 7896 nonspams.  The message-counts from
bogoutil -w bogodir are 1475 and 408.  The database sizes from full
training were 10 and 4 Mb; randomtrain produced .db files of 7 and 1.2
Mb.  I don't yet have figures comparing discrimination by bogofilter
with these two training sets, but yesterday's smaller-scale test (which
motivated me to write this script) clearly indicated an improvement
could be expected.

Greg Louis <glouis@dynamicro.on.ca>

Generated by dwww version 1.15 on Sun May 19 02:23:18 CEST 2024.