Tuesday, May 19, 2009

Techniques to Fight Comment Spam

The following post is a list of techniques that I have run across that attempt to deal with the problem of comment spam on sites.  This is a followup to my last post titled, "Preventing Comment Spam."

Requiring authentication is generally seen as a fairly effective approach to preventing comment spam.  However, the disadvantages are frequently enough to dissuade implementation on many sites.  One problem is that authentication is a feature requiring resources and expertise that not all development shops have enough of.  Another issue is that authentication results in a barrier to leaving comments on a site that casual visitors will probably not bother overcoming.  In addition, the challenge is not that great for technically adept spammers if the payoff is access to a large user base.

Building upon authentication is the idea of karma.  Forcing users to build karma based on quality of participation before they can take certain actions is usually a hurdle that is too high for most spammers to deal with.  Unfortunately, depending on your user base, it can be an equally high hurdle to legitimate participation.

Moderation is another technique that comes up often.  It is generally regarded as the only foolproof approach.  Simply put, every post made to the site is screened by a human being.  The downside of course is that if your site is heavily trafficked by spammers, weeding quickly becomes a task that takes up all of your time.

Filtering is another approach that can best be described as automated moderation.  As an example, I found ReverseDOS.  This is an easy to setup ASP.NET  HttpModule that reads all of the content of a request and determines whether or not the request is a spam attempt based on rules that you define.  The rules can include checking all or only a portion of the request against a set of regular expressions and can be turned on or off for each directory within a site.

Another suggestion along these lines was to create a central repository for tracking spam.  Sites could query the repository which would try to determine if the submitted content was spam based on past submissions, user feedback and a bit of good natured artificial intelligence.  Regardless of the technique, the idea of filtering is to cut down the number of spam comments to an amount manageable by other means.

Reverse Turing tests like CAPTCHA can sometimes be used to increase the difficulty of posting spam.  The problem is that the effectiveness of the most common implementation, retyping words presented as an image, wanes as image recognition tools get better and better.  The images must get more warped in order to prevent automated scanning, but that makes it more difficult for legitimate users as well.  E.g., Google's captcha for new emails is so difficult to read at times, that I only get one out of four correct.

Throttling can be used in order to prevent any user from posting too many times.  Limits can be set on the number of items that can be created over a span of time or making sure that no user posts multiple comments back to back in a single thread.  The challenge here lies in identifying users.  If no authentication is used, relying on IP address is inconsistent at best and runs the risk of blocking legitimate users.

In the end, no single approach is probably good enough to stop spam. The pet project I am currently working on has been built with a mix of most of the techniques above.  I combined a bunch of existing frameworks with a little bit of custom code so it wasn't too much work.  At times I worry that I may have spent too much time on this aspect of the site.  Then again, the whole site was started as a learning endeavor.  If nothing else, I gained some knowledge and will have the tools in place to respond quickly if spammers begin to target the site.

No comments:

Post a Comment