Behind The Scenes - Dealing With Comment Spam
For a while running this site was becoming a headache as the amount of comment spam I had to deal with began increasing day on day. At one point there were 100s a day coming in. Unlike email spam there's no truly reliable and effective solution to this problem. Although now I think I've arrived at a suitable compromise.
The most commonly used solution out there seems to be getting the user to prove they're a real person by typing the text they see in an obscured image in to the box below it. Personally I really don't like this approach. It annoy me no end whenever I have to use one. Not the principle of it as much as the way the image is so obscured it's sometimes impossible for me — a real person — to make sense of it.
If you have a blog and wat to encourage commenting from as many readers as possible, then you need to make it as painless as possible.
It's my choice to run this blog, so the burdens that come with it rest on my shoulders. Not yours. As such I've had to put in place ways of dealing with the tide of links posted to blogs without resorting to any kind of person-or-computer front-end logic. This started as a fully manual process of deletion but has since become a semi-automated process and 99% effective.
The "comment" form has two fields on it (well, obviously more than that, but...). One is "SaveOptions" and the other is called "Approved".
SaveOptions is computed and refers to an App Setting document which lists the "names" that have been banned. The majority of spam here used to come from people called "Cialis" and "Viagra". Maintaining this list and not even saving documents posted by them has cut out 75% of the problem.
The "Approved" field is also computed and equates to a boolean 1 or 0. One of the things this field does is trap comments to any blog entry older than 7 days. The vast majority of feedback on a blog entry comes either on the day it's posted or during the following couple of days. As soon as it's replaced by a more recent blog entry then the comments all but stop. On the flipside almost all spam is posted to old entries (maybe in the vain hope we either won't notice or don't care).
If the comment is approved it appears directly on the web page and I get an email telling me. If it's not approved for whatever reason then it won't appear directly on the web. Instead there's a message explaining that it's awaiting approval. The same email comes to me but it also highlights the fact it's unapproved and needs my attention.
This traps another 20% of the problem and at least means there's no chance of them getting any Google juice while I get round to deleting them. The fact that it doesn't appear on the web immediately should also act as a deterrent against bothering. Even if it did appear, all links in blog comments are rendered with the attribute rel="nofollow", which stops them getting the Google traffic they so crave anyway.
While this might trap the spam posts I still don't want them to be in store.nsf at all and so I make a point of clearing them all out. To do this I use a view in the Notes client called "everything". It's the view all databases should have, with columns including @Created, Remote_Addr, Name, Form etc. Here's what it looks like:
The reason I prefer to use the client-side view is mainly that deleting multiple documents in one go is much quicker than in the browser. There are patterns to spam. Most of it comes in spurts of 30 or so at once and happens in a matter of minutes — almost always overnight. This makes it a lot easier to deal with in one single swoop, especially using a client view. It's become one of my chores each morning when I wake up.
And a chore it can be. Although, having had plenty of time to get used to it and notice the pattern to how it happens, it's now just a sad fact of life on the web, but a lot easier to deal with using simple measures to block it and remove it. Whether it will ever be a fully automated system is another matter. One thing's for sure - I won't ever expect you to do it for me.
The thing which has worked most effectively to check HTTP_Referer, if it is blank when saving then I set SaveOptions to "0". It seems that most comment spammers run as bots and so aren't navigating through the website, thus don't have a history trail.
It works almost 100% of the time.
Hadn't thought of that. Does it work with users who have their browser prevent HTTP_Referer from being sent? Assuming that can be done. I'm sure it can be done and is used by ultra-paranoid surfers.
I've had 100% effectiveness with the "captcha" I added which simply asks the commenter to solve a rudimentary math equation. I went from about 30-40 spams a day to zero ever since. So far, nobody has complained and I am still getting real comments from people.
It does rely on Javascript, which puts a small percentage of readers in the "can't ever post" category, but the number of people not running JS in their browsers has become statistically insignificant, IMHO.
I like your "approval" concept, though, Jake. My goal was to eliminate any sort of maintenance, so I didn't go the route of needing manual attention to things. Yours makes for a more forgiving approach though.
Nice explanation! Your approach might be even more effective with "negative captcha", mentioned here:
{Link}
I personally have no problem with the 'obscured image' method, if it's good enough for Paypal, Google et al ...
No problem as what Paul? A blog owner or a user? If it's as a user then you have a lot more patience than I do.
Jake, I'm on your side. I hate putting the load onto the user. Captchas fail when it comes to users with disabilities. But like Mike says, Spammers are running bots, but not "using" your site. I find a nice little trick of putting a <input type=hidden> field in and then using a small piece of JavaScript to set the value to the DocID of the parent later down the page works. If the submission doesn't match field vs parentID then they haven't used your webpage, and you can effectively block the message.
Now people will complain about how it discriminates against users who aren't running Javascript. But seriously folks, who doesn't have JavaScript enabled or is using ancient (seriously ancient) browsers?
Jake, I think your methods are pretty sound. I personally browse with the NoScript extention and do not pass referrer, though I typically am much more paranoid than the average user :P My blog has been running off of wordpress and Akismit has really been doing an awesome job stopping spam, it lets one through every once and a while.
The fun way to replace obscured images would be to use easy riddles (or harder if so wanted). For example:
Q: "What is regarded man's best friend? Sometimes called K9... Starts with d and ends with g..."
A: dog
Hehheh :-)
I like Damien's "Negative Captcha" approach:
{Link}
It uses a hidden (via CSS) email field in addition to the displayed email field. If that field has a value, the comment was submitted by a bot and can be deleted. When humans enter comments, that field will always be empty.
Ned Batchelder also wrote an article on the subject after Damien's post
{Link}
The negative captcha trick is a great solution to protect against robot spam.
Both BlogSphere and the IBM Blog template ( previously DominoBlog ) both use a different method where the act of clicking the submit button runs some javascript to add an extra aprameter to the http POST request, this is then checked by the server. Robots just use a simple submit and can't run the javascript so both those templates are protected from robot spam also.
But there is one other type of blog spam that there is no way of blocking and that's the stuff that's entered manually but people. It doesn't happen often but it does happen. The only way of blocking this is to do it manually, either by locking old entries down like Jake has suggested or by moderating all comments.
Deinitely need to use multiple methods that create a weighting system to assist with filtering. The Referrer method, coupled with the javascript payload that Declan mentioned, and a keyword filter are probably the quickest/least painful combination to stop the vast majority of spam. But that negative Captcha sounds very intriguing.
The more I think about the more the idea of using the SaveOptions field at all is just plain bad. Unless, that is, you're 100% sure it's spam. using HTTP_Referer and SaveOptions is a no for me. It's just too likely you're going to get false positives and users lose all the text they just typed in.
Much better to use an "Approved" field filter it manually in the backend.
I currently use an approach I saw on Volker's site with a minor twist.
I have an image that called 1492.gif (for example) next to a field, we'll call "keycheckval", which asks for the visual numbers/letters from the image.
I have a hidden field called "keycheck" which stores the name of the image that I've loaded. (For this example, it'll store "1492.gif").
Now the user enters what they see for that image - and here's the trick - the image doesn't display "1492" but rather "1234".
Now, I post keycheckval ("1234") and keycheck ("1492.gif") to my comments agent which looks to a comparison table to see if the visual value of the paired image matches what the user entered.
1492.gif = "1234"
5512.gif = "4321"
blah.gif = "9999"
... etc,. etc., etc.!
So, if a user/bot enters "1492" instead of the of "1234" (thinking I paried the name of the image to it's correlating key value), the agent immediately returns false and won't process the comment. Otherwise, you either 1) are a bonified human-like person or 2) deserve to have your spam on my blog.
Hopefully that makes sense to someone other than me...
All I know is, that after implementing this counter I've yet to have a single SPAM entry in my comments.
-C
One thing about captcha's is that bots, as Chris points out, can be adapted to them.
Here is where Jakes method works well. He's using heuristics to limit the amount of spam he gets, then uses human monitoring to filter out the rest. It's better than beysean filtering, but still requires manual work for the admin.
I look at the 'work on the user' issue as a minimal thing. We already enter our name, email and address and type long winded comment... a few more digits is hardly an inconvenience, though I can't disagree with the principal of not putting a barrier between the user and posting a comment.
Back to the adaptable bot scenario though, it seems anything you do - hidden fields, negative captchas, regular captchas, has to be made sufficiently random, or easily modifiable, to make it as trivial for the admin to adjust as it is for spammy to tweak or adapt his bot. Eventually, one party will give up... and bloggers are a stubborn lot!
Hi Jake,
I also use tje javascript method. It does not burden the real user and blocks all comment spam from bots:
{Link}
The only thing left is manually entered comment spam, but that is easier to delete manually than to maintain a huge ban list for, in my opinion.
I use a modified version DominoBlog for my blog template. As has already been stated it "runs some javascript to add an extra parameter to the http POST request". It also uses a link throttle. Set the maximum number of links that you will accept and almost all comment spam goes over your threshold and the comment will be refused. The configuration form also has the ability to require fields, which I've set to require all except the web site field.
I've added a "honey pot" field {Link} that is checked in the code that accepts the comment. If it's not right, the comment is blocked.
These measures work for all but humans. If they don't break the rules, then the comment is accepted. However in the last three and half years I've only received three or four human created comment spam messages.
I subscribe to my comment feed so that I can check for comment spam. I do have the option to moderate comments but so far haven't had to do that. After three years I set enabled the agent that automatically deletes all blocked comments as I have yet to find a valid comment in the blocked comments view. I'm so confident that there won't be a valid comment in the blocked comments view that I actually have it run every night.
I receive around 100 to 200 comment spam messages a day and over the last three and a half years 247 valid comments.
I forgot to mention using link counters to determine spam. I filter any post with more than three links/urls.
I support Dragon & Ferdy, javascript method is one of the best.
but checking HTTP_Referer is not always good. For some good users it's sometimes disabled or maybe some proxy-servers don't add those to headers.
One of the thing what we have (if domino runs on windows server and using wsock32.dll) is reverse IP-check, if it's in blacklist ( bl.spamcop.net,dnsbl.sorbs.net etc.).
If found, then comment is saved but it get's one attribute in it (IsSpam=".Yes"). It's saved and showed with icon in view later, because some users have good comments but in their computer has some worm or virus.