|Spam Statistics - thursday 2005-12-08 1807||last modified 2005-12-12 0620|
|TrackBacks Sent: None|
Edit: cleaner data with better graph
I'm getting a bit closer to being able to generate pretty graphs of my spam. My corpus is relatively small, only about 20,000 pieces to deal with from 2005, and, sadly, Thunderbird still doesn't have a good way to differentiate between spam I've marked manually and spam it's caught automatically. So there are really only two engines to compare: SpamAssassin and Thunderbird on whatever SpamAssassin doesn't catch (the missing third level in this scheme is me on whatever Thunderbird doesn't catch).
Here's one graph manually ripped over to Excel:
This is the graph I've been trying to get around to making for about a year, though I really wouldn't have gotten anything that made a whole lot of sense until now, with about a year's worth of useful data to use. It's not entirely correct yet. The top bar is SpamAssassin's catch, the bottom is anything it got wrong, including false positives and false negatives (false positives make up for less than 0.1% of the data - aggravating rate to go looking for badly marked spam, really). Mostly it's Thunderbird's catches, but (see above discussion on Thunderbird's shortcomings).
I generated this using... RDF. With Stefano's RDFizing mail tool, I managed to take a look at the data in Longwell, which told me where I needed to make adjustments in the code or in the data so everything would work out correctly. It doesn't all fit yet; I don't much care about message IDs in spam, so that part may have to go since spammers certainly don't care about getting it right.
What happened in April (around the 8th or 9th) that started a decline of almost 2000 pieces per month? I haven't a clue. Did somebody get arrested?
I have further data on required scorings, scores, version, etc., that I'd love to graph. More on that data and graphing later.
You must login to leave a comment