The CAPTCHA arms race
CAPTCHAs... we have all seen them. CAPTCHA means Completely Automated Public Turing test to tell Computers and Humans Apart and is a family of techniques to make sure a user (typically on a website) is indeed a human being and not a program trying to act like one.
When you leave your comment on this blog you will be asked to type in two words which are displayed as distorted graphic. Most bulletin boards and free mail providers ask you to do the same before they allow you to create an account.
CAPTCHA 101
The reason behind is the same most of the time: Preventing SPAM. Spammers use forums, blog comments and contact forms to post their ads. They use bots (quite similar to the bots that update the search index on Google, Yahoo and all other search websites) to automate that process.
So the idea of CAPTCHAs is to present a task to a website visitor that is difficult to solve for a machine, but easy to solve for a human. The graphical CAPTCHA is the most commonly used one.
There are other CAPTCHA variants such as audio-based ones or image recognition based CAPTCHAs. I've even seen a simple math question as CAPTCHA.
The arms race
In December 2009 Jonathan Wilkins announced that Googles' most prevalent CAPTCHA method, reCAPTCHA has been broken. It is now possible to identify the words presented by reCAPTCHA with an accuracy of around 20%. For spammers that is good enough. WIth a 20% sucess rate, every fifth attempt will result in a successfully placed ad. Mr. Wilkins argues that even a success rate of 1% is good enough since the resources used by spammers often are not their own, thus their utilization is free (think bot nets).
This is bad news for everyone. I really hope Google updates its reCAPCTHA algorithm to a variant that is harder to solve by machines. For the record, I also use reCAPTCHA here in this blog.
Update: As of December 31st 2009 reCAPTCHA seems to be updated. Google responded quite quickly. So far I can already say that the number of spam posts I get in this blog has reduced drastically, albeit not come to a stop. Anyways, thanks for a quick response Google!
This situation is critical because spammers do not need to be particularly good at breaking CAPTCHAs. If one out of five CAPTCHAs can be broken and spammers still make a living out of this the CAPTCHA itself is useless in the sense that SPAM will enter your system.
Jonathan Wilkins has a .pdf paper where he gives guidelines for the creation of strong CAPTCHAs. It is a really interesting read even if you're not involved with CAPTCHA development directly.
Which is the best solution?
Well I guess your mileage may vary.
For now I will stick to reCAPTCHA [official homepage] although it is broken and I need to remove a few unapproved comments every day. I like the idea behind the project so I'm willing to accept the minor annoyance that it currently imposes.
Text-recognition CAPTCHAs such as reCAPTCHA require strong OCR solutions and to my personal surprise, that is still a field what needs a lot of improvements. So even if reCAPTCHA becomes to cumbersome for me, I'll stick to another visual CAPTCHA method.
Audio CAPTCHAs are not recommended by Jonathan Wilkins because he argues that the field of speech regognition is more advanced than that of OCR. Aside from their security, I don't want my visitors to do something unfamiliar, and listening to an audio file to fill out a form certainly is.
I like the idea of asking a simple question, such as "What color is an orange?" or "What is 3+5?". Not sure about the security though. The latter one can be automatically solved by Google itself for example. However, I'm half way convinced that this is an approach that has a bright future.
Promising examples of what might be next
SQUIGL-PIX
On the reCAPTCHA website you can find a link to the SQUIGL-PIX project, apparently the latest project by the reCAPTCHA guys. It presents you with three images and asks you to outline a certain object. Only if you outline the object correctly (after chosing the correct image) the CAPTCHA is solved.
Give it a try. It is fun, easy (for us) and I sure hope it is hard for machines.
CAPTCHA The Dog
Another interesting approach is Captcha The Dog. You are presented with nine images total and have to pick the one that shows a dog while all others show a cat. You have to pick the single dog several times (from different picture sets) and click 'ok' once there are only cats.
The idea is brilliant and the basic reasoning behind it is the same that makes SQUIGL-PIX good: Object recognition instead of text recognition.
Captcha The Dog goes one step further and allows you to use your own set of images, tapping on the financial feasibility to break your individual set of images. There is even a Wordpress plugin available but I have to give a warning: According to the installation page the plugin requires allow_url_fopen and allow_url_include both to be active. Sounds like trading one evil for another. XSS anyone? Too bad, the idea is great.
3D image rotation
The third approach I'd like to present is proposed by Taylor Hayward and apparently does not have a name. It asks you to identify an object appearing in two rotated 3d renders. You are presented with one control image, and a set of nine randomly rotated images out of which one is the (rotated) control image. I found it hard to imagine so go see the blog - it'll be much more clear.
Once again, the method relies on object recognition. Great idea.
"Do not try this at home"
When you google for "php captcha library" you find literally thousands of home grown 'solutions' for secure CAPTCHAs. Once again I can only urge you to read Jonathan Wilkins paper on secure CAPTCHAs before you use one of them. The bottom line probably is that they will not work very well for you because their authors try to obfuscate the letters in a way that poses no or only very limited issues to OCR software.
Just assume that if Googles' solution to CAPTCHA is broken, yours will get broken too once the incentives to try are high enough for the bad guys.
That being said, developing your own approach while sticking to Jonathans guidelines will most likely be an interesting spare time project.
Possible benefits
Believe it or not, I really see possible benefits coming out of this arms race. As the spammers' tactics to solve CAPTCHAs improve, the good guys are forced to improve their generation of CAPTCHAs. After some time, the newer CAPTCHAs will also be broken. The cycle continues.
Naturally, the only way for the good guys to verify that the new version of a CAPTCHA is indeed more secure than the old is to think like a bad guy and try to break your own CAPTCHAs.
This might - and I believe it surely will - lead to better OCR software, better audio recognition and in general a higher standard in 'intelligent' algorithms that are able to solve every day problems.
Audio recognition might help the deaf to read what people say, by other means than lip reading. A universal translator (famous in Star Trek) is not completely out of scope either, although it is still really far away.
Text-recognition is highly in demand on mobile devices. If a program can identify highly distorted characters from a CAPTCHA, I'm sure the same ideas can be applied to read hand writing.

