Last spring, I started wondering about the web email address harvesters spammers use. I knew they were hitting my site I get spam to addresses only shown there. But I had no idea which of the entries in my Apache logs corresponded to the spammers’ harvesters. I didn’t know how many different harvesters were coming around. I had no clue how long it took between the time an address is harvested off the web and the first piece of spam comes in. And I didn’t have any way to "take back" addresses after they were harvested.
So, on a boring weekend day I decided to make a simple system to help me gather information. I wrote a tiny bit of code to generate a unique email address for every page load on my main web site. Every time one of those pages is fetched, the email address at the bottom will be different. It’s basically an encrypted identifier that I can later correlate with log entries. Incoming mail to anything in the subdomain used for those addresses goes through a bit of software that decrypts the ID (the left hand side of the address) and makes sure it’s a valid generated address. This validation step has the added benefit that I can add in any harvested address to a blacklist as soon as I receive spam on it, preventing any future spam to that address. And since I can correllate it with the logs to find out who harvested it, I can also invalidate any other addresses sent to the same client. And this all happens without inconveniencing people who want to send me mail from my web pages; there’s still a perfectly valid, clickable email address on every page.
This little experiment hasn’t produced any groundbreaking information, but I have found a few interesting tidbits. I was surprised by how little of the spam I’ve received turned out to be from recent address harvesting. In the 8 months or so that I’ve been doing this, there have only been about fifteen spam messages sent to these addresses. It’s possible that it takes longer than 8 months for the addresses to get into wide circulation, so I’ll have to keep watching to see if the spam ratio ramps up.
Here are the access log entries that directly resulted in spam:
A couple of the clients were nice enough to actually send along real referer information. That kind of surprised me. One sent an obviously faked referer of "http://microsoft.com." A couple of them (both from IP addresses in China) have the bogus user-agent string "Internet Explore 5.x." The rest either sent no user-agent header at all or had one that looks fairly "normal."
The shortest time between the address being harvested and receiving its first spam is seven hours. The longest is 117 days. It seems that it almost always happens within a day or otherwise it takes weeks. I’ve considered adding in a timestamp and having the addresses only good for four hours or so to allow leigitmate messages to get through without allowing spam in, but I haven’t gotten around to implementing that yet.
I was hoping to find some more decisive patterns in spam harvesters that would allow me to block a significant number of them. I’m not surprised that I didn’t find it, but I am a little disappointed. I’ll probably start blocking anything that identifies itself as "Internet Explore" or sends a referer of "microsoft.com." If my page gets linked from the the main page of Microsoft’s web site, I’ll be in trouble, but I’m not losing any sleep over that.