Now Gathering Web Spam - monday 2006-03-06 1626 last modified 2006-03-06 1626
Categories: Nerdy, ryanlee.org
TrackBacks Sent: None

You should ignore the 'trackback' section on each journal entry for a couple weeks, I'm trying to build up a corpus of web spam for my classifier (really more like Paul Graham's classifier, now in Tcl). I made the mistake of trashing all those bad trackbacks over time - I kept 12 good out of 19,275 attempts over the past two and a half years, and 9 of the good are from me, though the vast majority of the time I've set the trackback system to simply reject any pings - and so I have nothing 'bad' of my own with which to train the system. I'm assuming the noise to signal ratio is going to be incredibly high again.

I'll need to redo the comment system a bit to allow pseudo-anonymous commenting, but that can wait for a bit until I have enough spam to train my classifier to work well. I think the ratio there is more muddled and will end up being more useful than trackback auto-classification.

Comments

It's almost too ...

It's almost too easy. All the trackback spam so far has the exact same text in the post title, post excerpt, and source name. This isn't a fun way to do auto classification learning. Where do I go to get some real spam?

Ryan Lee on March 07, 2006 03:34 PM

You must login to leave a comment

TrackBacks

Trackback Autoclassification

I flipped the switch for autoclassification of trackbacks today. With around a 2:1 ratio of desired comments to spam, I hope it's enough data to go on. Either trackback is more dead than I thought or I just don't get many; I could probably go back to...

Ryan's Journal on March 20, 2006 07:31 PM [trackback]