Breaking reddit captcha with 96% accuracy

I have  written earlier about the vulnerability of captchas and how easy it is to break if we have the generating software to generate datasets for training deep neural networks.

No we will have a look at breaking reddit captcha assuming we don’t have access to generating software. This is an extension of a post detailing here. They say they have a success of about 10%. We will improve that drastically.

To create the training set, we only label 500 captchas. But due to an implementation issue of reddit captcha, we can get a large training set automatically. Follow this link to see the captcha of reddit. If you reload the captcha with the same url (make sure you are logged in reddit), you get a variation of the same captcha under a different transformation. We use this vulnerability to both create a larger training set and enhance our prediction confidence. For example, when using fixed url we get multiple samples of JOEPBO

Screen Shot 2016-01-05 at 10.40.11 am.png

Segmentation

As we see in earlier blogpost, segmentation is not necessary but we have an easy hack to segment which always helps. Here is an sample captcha image,

 

We can remove the background noise as they are at lower intensity, leaving us with the letters, we can further remove the dots by using connected components and segment into six seperate characters.

download (2).png

 

Character segmentation and training

Screen Shot 2016-01-05 at 1.13.19 pm

After we segment a captcha and label it, we can use multiple samples of the same captcha to augment the data set. So if we label 500 captchas, and have 100 samples of each captcha, we have 50000 pairs of images and labels. As our segmentation is pretty noisy, we can remove some samples where segmentation is not ideal(more than or less than 6 segments of characters or size of each segment). Approximately 50% of samples are discarded this way. Training on these segments this way leads to individual character accuracy of around 90% (of valid segments), which leads to around 60%(0.85^6) accuracy of the whole captcha (assuming segmentation is done properly). If segmentation is not accurate, we can get the correct captcha after seeing several captchas of the same text (remember we can get multiple samples of the same captcha with the url), depending on the confidence required. We get around 90% accuracy on test cases at 30 samples of each captcha. We also get around 75% at 10 samples and 96% at 100 samples. The code can be checked at github here.

Conclusion

So we see that reddit captcha has some basic vulnerabilities (removable noise and multiple samples) which makes it easily crackable. Once more we see captchas are ineffective and only leads to user inconvenience. Accuracy can be improved with sequence to sequence learning with first sequence as segments and text as the other.

Advertisements

One thought on “Breaking reddit captcha with 96% accuracy

  1. hi there
    trying to run your github code, simply running: `th main.lua` and im getting error `luajit: cannot open `
    looking into `char` module, there is a function call `local Y = data.loadY()` what reference module `data` and function:
    function data.loadY(file)
    return torch.load(file or ‘data/Y.t7’)
    end

    this function is trying to open `data/Y.t7` file what does not exist.
    Is it missing from the repository or do i have to generate it somehow?
    thank you

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s