magnuseriksson.se

Saturday, July 31, 2010

How to protect your web site from comment spam

Author: Magnus Eriksson
Published: 2006-10-01

Preface

Comment spam is something that you might think only affect large websites with thousands of visitors each day. That could have been true in the past but with more and more sophisticated spam tools available out there, also small sites takes a hit when the spam tools don't discriminate where their spam URL's end up.

This article will give an overview of the options available to combat comment spam and their respective pros and cons.

The ambitions of spam protection

The most obvious goal naturally is to identify and block the spam comments and let the good ones through. But there are also some secondary goals to consider that can affect the user experience.

Accessibility

Not everybody is using a graphical web browser such as Internet Explorer or Firefox. There are users that for different reasons use purely text based browsers like Lynx and there are e.g. visually impaired users that use speaking browsers that interpret the written text. It is important not to forget this group of users when designing your comment spam protection. This consideration could exclude all protection based on human image recognition though, which is one important technique used for spam protection.

Privacy and security

Some users have disabled the use of cookies in their browser for privacy reasons and other may have disabled JavaScript for extra security. Both these technologies are important for automatic spam detection. To provide the best possible solution, you should not assume that those technologies are available in the user's browser; otherwise it might be impossible for those users to submit valid comments.

Discussion latency

If you do not only expect comments on the original web page but also that the users will comment each others comments and those discussions will flourish, then you are better off with an automated protection rather than a moderated one. If the post does not show up until several hours later, that will surely kill a spontaneous discussion.

The threats

There are different types of comment spam and the used type depends on the effort the spammer wants to put on getting the spam on a specific page. If he is only interested in getting one single message one time on a specific page, then the manual approach is probably used. Normally however, it is only useful to spam when the spam message can be periodically submitted automatically onto multiple sites.

Manually entered spam

This is spam messages entered by humans manually. This is the most difficult spam to combat, especially since some protection mechanisms depend on the fact that it is not a human that submits the spam.

General spam tools

Spam tools of today are getting smarter. They are designed to behave as much as possible as a user would do when using a standard web browser. A certain number of target pages need to be identified which might have been done manually or by some other tool beforehand. The page where comments should be posted is then first requested once to ensure that the server starts a session and also to get the value of any hidden fields in the form that the server generates per session. It is likely that new types of tools also support JAVA parsing and can handle e.g. simple field name renaming.

Web site customized spam tools

If one wicked spammer has really decided to attack a specific website or web publishing system then it is much more difficult to respond to that. All kind of obfuscation is useless since the spammer has access to all client side protection the web master might have added.

The techniques

There are a few different techniques available. They all have their pros and cons, and the degree of problems they can cause is also dependent on the size of the website they are going to be used at and the accessibility limitations the webmaster can live with. The different techniques are covered in more details below.

Moderation

This is the process of actually reviewing and approving each new post before it is actually published on the website.

It is definitely the most accurate method since you as the webmaster is in total control of which posts that are rejected and which are approved for publishing.

There are two fundamental problems with this approach though. One is that the webmaster can get overwhelmed with new posts that need to be reviewed if the site is popular and the second problem is that real time discussions between users are not possible since it might take too much time between the comment is posted and the webmaster has finally approved it.

One middle way between full moderation and no moderation is to let trusted users log in to an account and let their posts be un-moderated but still let anonymous posts be fully moderated.

The moderation is usually implemented so that the moderator gets an email with the posted message and gets two links to choose from, one link to reject and one to accept the post. Web server scripts will handle the underlying logic.

Obfuscation

The problem with automatically detecting spam generated from automatic spam generators is that the programmer of the tool can make a thorough investigation of exactly how the web form should be submitted in order to be accepted by the server.

Obfuscation is about complicating the way the fields in the form is generated and how it is transmitted so that an automatic tool can never find it out automatically.

The most efficient way for the tool programmer to make the tool functional with the most amounts of web sites is to handle the most common obfuscation techniques. That's why the most important technique is to have a unique obfuscation scheme.

The web form consists of the visible form elements such as text fields and radio buttons etc. and hidden form fields that can be used e.g. for identifying requests and spam protection. The hidden fields can have a default value generated by the server and the client can change these field values by e.g. a Java script algorithm in a way the server is aware of, the server can then compare the expected value with the received value.

One problem with this is that Java script might not always be enabled by the client as described in the accessibility section. Another problem is that smart tools can support a general Java interpreter and can therefore treat the request exactly the same way as the real request would have been treated.

User assistance

The issue with the pure obfuscation scheme that the spam tools can mimic all the obfuscation can be avoided if you let the users of the website carry out one part of the job.

One widely used approach is the practice of so called CAPTCHA images. CAPTCHA is an acronym for Completely Automated Public Turing test to tell Computers and Humans Apart. The principle is that there are still things that humans can do so much better than computers like image or audio recognition. The server generates a picture which contains distorted letters of a word or just random letters. A human can (often) immediately see which letters there are but it is (almost) impossible for a computer to identify them. The user enters the letters in a text field that is part of the web form and the post is only accepted by the server if the letters are correct.

The problem with this approach is that it is almost impossible for visually impaired people to access the protected resource which poses a serious accessibility concern.

Another method that also falls within the CAPTCHA umbrella is to generate a simple mathematical problem that the user must solve and submit together with the rest of the form fields. This method makes the website accessible to more people while still have a strong protection against general spam tools. A dedicated tool for your website will easily be able to crack that system and post spam comments though.

A third mechanism is to ask a random question to the user and let the user select an answer from a selection of a certain number of radio buttons. The question can even be a statement asking the user to select radio button number X. The benefit of that is that there is really no extra effort needed by the user at all. A general spam tool could however make a guess out of the possible options or maybe even submit one post for each option.

Content filtering

This is an advanced method that is based on identifying certain words or statistical properties of the content of the text of the post itself. The idea is that the typical spam comment will contain certain identifiable words or e.g. excessive use of links. The problem with this is that it is difficult to come up with a good algorithm and it is almost impossible to get it right every time. Either valid comments might be rejected or spam will be accepted.

This is the same method that email clients have to use and you can judge for yourself how efficient the spam filtering is in your own email client.

HTTP Header checks

There are some properties of a client request that the server can check directly e.g. the user agent and the referrer field, both of these are transmitted as HTTP headers.

The user agent field is the one that identifies the web client application. Since there are a multitude of valid web clients out there, it is not possible to rule out a request based on what is stated there. If the field is blank however, it is usually safe to reject this post.

The referrer field is used to identify the URL of the web page that linked to the post handling page i.e. the web page that the web form is located on. You could check that the referrer field is actually pointing to your web form and reject the post otherwise. The problem with that however is that many firewalls have options to block referrer fields in HTTP requests. Therefore it is no longer feasible to have that kind of check.

Summary

Contrary to email spam, there are many different methods available to combat comment spam. There are no ultimate solutions though; the chosen solution depends completely of the accessibility requirements on your site and the threat picture i.e. if it is likely that someone will target your site specifically.

Do you have anything to add?

Here you have the possibility to add your own comments. If you agree, disagree or have own experience of the subject, then you can write about it here.

Form for comment submission

Without subject

free slot machines software - 8/7/2009 8:20:46 AM

geeta's goli: Awesome. just awesome...i haven't any word to appreciate this post.....Really i am impressed from this post....the person who create this post it was a great human..thanks for shared this with us.i found this informative and interesting blog so i think so its very useful and knowledge able.I would like to thank you for the efforts you have made in writing this article. I am hoping the same best work from you in the future as well. In fact your creative writing abilities has inspired me.Really the blogging is spreading its wings rapidly. Your write up is fine example of it

Link = http://www.4freeslotmachines.com

sfgsg

dafsdf - 1/28/2010 2:48:52 PM

fgfdgetg

xyz

xyz - 1/28/2010 2:49:53 PM

mmmmmm

Vemma etf

vemmawa - 7/22/2010 1:31:58 PM

Awesome Post. I add this Post to my bookmarks.

Link = http://www.buyvervevemma.com/

Tahitian Noni wei

tahitian noni juice - 7/28/2010 8:38:50 PM

You certainly have some agreeable opinions and views. Your blog provides a fresh look at the subject.

Link = http://www.tahitiannonijuice.info