Spam Statistics - friday 2005-12-09 0207 last modified 2005-12-12 1420
Categories: Nerdy
TrackBacks Sent: None

Edit: cleaner data with better graph

I'm getting a bit closer to being able to generate pretty graphs of my spam. My corpus is relatively small, only about 20,000 pieces to deal with from 2005, and, sadly, Thunderbird still doesn't have a good way to differentiate between spam I've marked manually and spam it's caught automatically. So there are really only two engines to compare: SpamAssassin and Thunderbird on whatever SpamAssassin doesn't catch (the missing third level in this scheme is me on whatever Thunderbird doesn't catch).

Here's one graph manually ripped over to Excel:

Histogram of spam by month

This is the graph I've been trying to get around to making for about a year, though I really wouldn't have gotten anything that made a whole lot of sense until now, with about a year's worth of useful data to use. It's not entirely correct yet. The top bar is SpamAssassin's catch, the bottom is anything it got wrong, including false positives and false negatives (false positives make up for less than 0.1% of the data - aggravating rate to go looking for badly marked spam, really). Mostly it's Thunderbird's catches, but (see above discussion on Thunderbird's shortcomings).

I generated this using... RDF. With Stefano's RDFizing mail tool, I managed to take a look at the data in Longwell, which told me where I needed to make adjustments in the code or in the data so everything would work out correctly. It doesn't all fit yet; I don't much care about message IDs in spam, so that part may have to go since spammers certainly don't care about getting it right.

What happened in April (around the 8th or 9th) that started a decline of almost 2000 pieces per month? I haven't a clue. Did somebody get arrested?

I have further data on required scorings, scores, version, etc., that I'd love to graph. More on that data and graphing later.

You must login to leave a comment

TrackBacks

Spam Sentencing

In answer to my own facetious question as to lessening spam and the going-ons of April 8th and 9th, there does in fact seem to be a criminal involved. The most relevant event appears to be Jeremy Jaynes' sentencing to nine years in prison on felony sp...

Ryan's Journal on December 10, 2005 10:41 AM [trackback]

Detailed Spam Breakdown

Unless I find something else fascinating buried in my data, this will be the last in my spate of posts on spam. The average file size of spam was very close to 10KB per mail. Altogether, the mail represented in these graphs comes to 19,660. The nu...

Ryan's Journal on December 12, 2005 09:57 AM [trackback]