Personal Spam Report 2006 - tuesday 2007-01-09 0846 last modified 2007-01-11 0001
Categories: Nerdy
TrackBacks Sent:

Like last year's report, with more graphs! There are four classes of spam mail: filtered correctly by SpamAssassin, correctly by Thunderbird, incorrectly by both, and incorrectly classified as spam by SpamAssassin (a.k.a. ham). It's a cascade, with me working at the end of the process. The first graph is a count of all of the classes per month in a stacked plot. This year's trend is upwards, the opposite of last year's general falling trend. The highest point in this year's is around the same as last year's. If next year continues in the same vein, that's going to suck.

Linear graph of spam per month, stacked by mail type

Here's the same data on a logarithmic, overlaid plot. In March through May and December, I marked more mail as junk than Thunderbird did automatically. April and May are somewhat odd in that only one badly classified mail was caught. That doesn't appear likely to happen ever again, and I do suspect I might have missed something in that timeframe. However, I suspect I missed more in September through December with more having been found.

Log graph of spam per month, overlaid separately by mail type

Like the first graph, but by week, followed immediately by a stacked graph of SpamAssassin failures.

Linear graph of spam per week, stacked by mail type

Stacked area graph of SpamAssassin failures, per week

The latter graph is of the percentage of all mail per week that was not classified correctly by SpamAssassin. I was quite shocked at the percentages at the end of the graph - we upgraded SpamAssassin around that point, and its performance appears to have cut failures by five-fold. It remains to be seen whether the volume of spam has anything to do with the performance change.

Finally, one of the more fun graphs - how much work I had to do per week.

Fill-type graph of SpamAssassin failures, per week

Thunderbird, manual, and ham represent failures in SpamAssassin. Thunderbird is automatic, but manual and ham are things I have to think about and pull out. The March through June period saw a general low in the absolute number of spams, but a marked increase in the amount of reclassifying I had to do. October through December have been struggles with a significant amount of real mail being tossed away. What made SpamAssassin so much more effective was probably the introduction of block list scoring, which all too frequently for ham outweighs the Bayesian and auto-whitelist scores.

Clearly I need to change my scoring threshold to avoid the ham mining. More on that later.

I made one serious change and decided to outright trust mail arriving through W3C's spam filters to be classified correctly as spam. The rest I pass through my own filters. I believe it's failed once with ham classification. Lately it's been doing worse with missing spam, but the rarity of a ham misfire is nice.

Things that affect the graph that I should track for next year: Thunderbird upgrades, SpamAssassin upgrades, changes to SpamAssassin configuration. I won't be comparing to my volume of real mail unless I find a better way to generate these graphs; I don't really want to keep trash around. The order of magnitude of spam to real mail, though, is around one more.

You can look at a page of (large) graphs to see all of the figures. I didn't use RDF this year, only shell scripts for numbers and Excel for graphs. I may do the RDFization of SpamAssasin mail to check if I missed anything. If I get bored.

You must login to leave a comment

TrackBacks

Ryan's Journal: The Year in Spam, a Dying Tradition

Excerpt: ... Ryan's Journal RaynDrop : Ryan's Journal [ RSS Feed | All Entries | Previous Entry | Current Entries ] The Year in Spam, a Dying Tradition - saturday 2010-01-02 0413 ...

Ryan's Journal: The Year in Spam, a Dying Tradition on January 02, 2010 04:13 AM [pingback]

Ryan's Journal: Deciding on a SpamAssassin Threshold

Excerpt: ... Ryan's Journal RaynDrop : Ryan's Journal [ RSS Feed | All Entries | Previous Entry | Current Entries | Next Entry ] Deciding on a SpamAssassin Threshold - tuesday 2007-01-09 1801 ...

Ryan's Journal: Deciding on a SpamAssassin Threshold on January 10, 2007 04:24 PM [pingback]

Ryan's Journal: Final Spam Notes

Excerpt: ... Ryan's Journal RaynDrop : Ryan's Journal [ RSS Feed | All Entries | Previous Entry | Current Entries ] Final Spam Notes - tuesday 2007-01-09 2341 last modified 20...

Ryan's Journal: Final Spam Notes on January 10, 2007 04:25 PM [pingback]