|Now Gathering Web Spam - monday 2006-03-06 0826||last modified 2006-03-06 0826|
|Categories: Nerdy, ryanlee.org|
|TrackBacks Sent: None|
You should ignore the 'trackback' section on each journal entry for a couple weeks, I'm trying to build up a corpus of web spam for my classifier (really more like Paul Graham's classifier, now in Tcl). I made the mistake of trashing all those bad trackbacks over time - I kept 12 good out of 19,275 attempts over the past two and a half years, and 9 of the good are from me, though the vast majority of the time I've set the trackback system to simply reject any pings - and so I have nothing 'bad' of my own with which to train the system. I'm assuming the noise to signal ratio is going to be incredibly high again.
I'll need to redo the comment system a bit to allow pseudo-anonymous commenting, but that can wait for a bit until I have enough spam to train my classifier to work well. I think the ratio there is more muddled and will end up being more useful than trackback auto-classification.