Deciding on a SpamAssassin Threshold - tuesday 2007-01-09 2301 last modified 2007-01-10 2124
Categories: Nerdy
TrackBacks Sent:

SpamAssassin's cumulative scoring comes down to a simple comparison to determine whether mail is good or not: does the score exceed the user's threshold? If so, it's spam; if not, it's good. Figuring out where that threshold number belongs should be a statistically-driven process, but it doesn't seem to me that there are many tools out there to help steer the decision (I'd love to know if I'm wrong about that).

So I mined last year's mail for some numbers. No graphs this time, just some figures. My current threshold is set to 1.5; 83 pieces of good mail (at least) exceeded that score. The mode, median, and average of those false positives was 1.6, 2.3, and 2.9.

Almost all of the good mail is non-uniformly clustered around -2.7, with a similar, smaller cluster around 0.4 and a spike around -5.9.

Spam is highly clustered around 2.0, with more than a tenth of all spam ranking around 1.9. The amount of mail scoring less than 1.9 and greater than or equal to 1.5 is less than 1% of all spam processed. I think I might be able to live with a threshold increase to 1.9, leading to an increase of 1% of spam I need to mark manually, and a 30% reduction in ham I need to rescue. I'd like to cut the ham rescue quota by more, but a 10% increase in spam seen is a little too high.

Additional data point for 2007's spam report: changed threshold score to 1.9 on 2006-01-09

You must login to leave a comment

TrackBacks

No TrackBacks for this entry.