I have written earlier about the vulnerability of captchas and how easy it is to break if we have the generating software to generate datasets for training deep neural networks.
No we will have a look at breaking reddit captcha assuming we don’t have access to generating software. This is an extension of a post detailing here. They say they have a success of about 10%. We will improve that drastically.
To create the training set, we only label 500 captchas. But due to an implementation issue of reddit captcha, we can get a large training set automatically. Follow this link to see the captcha of reddit. If you reload the captcha with the same url (make sure you are logged in reddit), you get a variation of the same captcha under a different transformation. We use this vulnerability to both create a larger training set and enhance our prediction confidence. For example, when using fixed url we get multiple samples of JOEPBO
We can remove the background noise as they are at lower intensity, leaving us with the letters, we can further remove the dots by using connected components and segment into six seperate characters.
Character segmentation and training
After we segment a captcha and label it, we can use multiple samples of the same captcha to augment the data set. So if we label 500 captchas, and have 100 samples of each captcha, we have 50000 pairs of images and labels. As our segmentation is pretty noisy, we can remove some samples where segmentation is not ideal(more than or less than 6 segments of characters or size of each segment). Approximately 50% of samples are discarded this way. Training on these segments this way leads to individual character accuracy of around 90% (of valid segments), which leads to around 60%(0.85^6) accuracy of the whole captcha (assuming segmentation is done properly). If segmentation is not accurate, we can get the correct captcha after seeing several captchas of the same text (remember we can get multiple samples of the same captcha with the url), depending on the confidence required. We get around 90% accuracy on test cases at 30 samples of each captcha. We also get around 75% at 10 samples and 96% at 100 samples. The code can be checked at github here.
So we see that reddit captcha has some basic vulnerabilities (removable noise and multiple samples) which makes it easily crackable. Once more we see captchas are ineffective and only leads to user inconvenience. Accuracy can be improved with sequence to sequence learning with first sequence as segments and text as the other.