Sequence to sequence learning to decode variable length captchas


In earlier posts we saw how to use CNN and RNN neural networks to decode captchas. Here we will look how to use sequence to sequence learning to transcribe images to captcha sequence. The image is not a sequence per se, but this example helps in demonstrate how to use seq to seq learning. Sequence to sequence learning has been used in machine translation, question answering and chat bots.

The sequence to sequence learning is based on using RNN (encoder) to encode a sequence of variable length to fixed length vector. Then we use another RNN (decoder) to decode the fixed length vector to the target sequence. The vector should be large enough to capture the various sequences possible. This solves the problem of having different length inputs and outputs.

First lets see the encoder. This is an LSTM which takes a sequence of inputs from net. Then breaks in into tables to feed to the LSTM sequencer. The output of the last LSTM is taken to feed into the decoder.

local enc = nn.Sequential()
 enc:add(nn.Sequencer(nn.LSTM(hsize, hsize)))

The decoder takes the output of the encoder and feeds it into another LSTM seqeuencer. Depending on the length of the output.

 local dec = nn.Sequential()
                 :add(nn.LSTM(hsize, hsize))
                 :add(nn.Linear(hsize, csize))

Then we use LogSoftMax with CLassNLLCriterion to calculate loss and backprop.

 local enc = seq.enc(net,hsize)
 local dec = seq.dec(hsize,csize)
 local encdec = nn.Sequential()
 local criterion = nn.SequencerCriterion(

Now we see how to use this network to decode variable length captchas. If the length of the captcha is not known before hand, we have to use RNN and plain CNN may not work. As CNN only works on fixed length outputs. For example the captchas below have variable length with spaces between them.


We use CNN with encoder to encode the captchas to a fixed length vector. Then we use decoder to decode the captchas. With this we achieve 50% accuracy on variable length captchas. The whole code can be checked on github. We will see how to use attention mechanism to increase the accuracy in later post.

Recurrent neural networks for decoding CAPTCHAS

Captcha breaking of simplecaptcha with 96% accuracy using RNNs.

Convolutional neural networks make assumptions regarding the size of input. But Recurrent Neural Networks (RNN) can work on variable sized input or output. Here we look how to use RNN in Torch to decode captchas. This will help in getting dependency between (adjacent) characters and can also work on variable length captchas (which we will see in a later post).

For a good overview of RNNs read this blog post by Karpathy. Please read that post before proceeding.

As we saw in an earlier post, using CNN we were able to acheive 92% accuracy. Here we build a RNN on top of the CNN which will lead to better accuracy as we will be able to model dependency of adjacent characters in the image.

Suppose we define the CNN net which take an image and gives as output 5 predictions of 36 classes each which is a 5×36 dim output. Then we split the output of the CNN to 5 parts such that each one goes to the RNN module, as follows.


Our RNN has 5 steps as the captcha size is fice letters. At each step our CNN net propogates a different part of its output (of 5 parts). Our RNN works on the hidden state and CNN output to output the final predictions. This helps in taking into consideration the earlier context to predict the current character more accurately. We will be using the Sequencer class from rnn package to achieve this (as follows)


local mlp = nn.Sequential()
      hiddenSize, nn.Identity(),
          :add(nn.Linear(hiddenSize, hiddenSize))
          :add(nn.ReLU()), nn.ReLU(),
     :add(nn.Linear(hiddenSize, classes))

 local rnn = nn.Sequential()

Here we see how to use SequenceCriterion to calculate the loss and backpropogate the gradients. This criterion takes a table of outputs and table of targets of size of steps of RNN and calculates the loss. We use tnet below to split our training Y to feed to criterion.

 local ct = nn.SequencerCriterion(
 local tnet = nn.SplitTable(2,2)
 local targets = tnet:forward(Yb)
 local outputs = rnn:forward(inputs)
 loss = loss + ct:forward(outputs, targets)

The code for the RNN captcha is here on github. Run rnnMain.lua to acheive accuracy of 96% on the captchas. We will look in later post how to decode captchas which have variable length using RNNs (sequence to sequence learning)

Residual networks in torch (MNIST 100 layers)

Residual networks architecture is a type of architecture introduced by MSR(Microsoft Research) which helped them win the Imagenet competition 2015 ahead of Google which won last year. Using residual networks, they were able to train very deep neural networks as deep as 150 layers much more then previously trained and got better solutions thanks to its architecture. Here we will have a look at the relatively simple change to the traditional deep neural networks, and which can be easily added to your existing networks with out changing the architecture radically.

The main problem faced by very deep neural networks is that gradients have a hard time being propagated through all the layers which is a well known problem. Many solutions have been extended previously like highway networks, gradient clipping in RNN and even batch normalization which ameliorates  gradient vanishing or gradient explosion among others here and here.

Suppose you have a layer L, which takes as input x and gives output  y = L(x). If the layer has to learn a function y = 0, then L need not learn much and have all weights as zero. If the layer has to learn a function y = x, it has a much harder time to learn the identity function and have to learn its weights according. Here we will start the layer as the Identity transformation. We will see how this helps in learning very deep networks.

One thing deep learning says is the more layers we add the better the accuracy of prediction. But its not so simple. We see performance peaking at some depth, but the more layers we add after that the performance actually decreases (both the training error as well as validation error). But the (smaller) best performing net can be transformed into a deeper net by adding identity layers on the smaller net. Hence if the layers are identity transformation by default, then the deeper nets would not have a problem to reach to the solution of smaller nets.

One way to do this is if a layers transformation is y = F(x), define it as y = F(x) + x. Here we assume that the dimensions of y and x are same. Remember that this is a layer transformation which can be nonlinear. See the following image taken from the paper.

Screen Shot 2016-01-05 at 5.57.43 pm

If the dimensions of input and output are different in a layer, y = F(x) i.e dim(x) not equal to dim(y), add a simple linear transform, y = F(x) + W * x.

This is similar to highway networks in a sense, the first layer gradients don’t explode or vanish.


Here as we are using MNIST, so lets focus on convolutional layers only. This is can be extended to others easily. We use 3×3 filters with stride 1 and pad 1 mostly as the convolutional layer. This ensures that the dimensions of the input is same as the output of that particular  convolutional layer.

nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)

Instead of max pooling layers we will use 3×3 convolutions with stride 2 and no padding. This leads to filters of size half in width and height. They also increase double the number of filters. This keeps the computation at each layer the same.

nn.SpatialConvolution(64 -> 128, 3x3, 2,2, 1,1)

So a plain (non-residual) network may look like:

 net = nn.Sequential()
 net:add(nn.SpatialConvolution(128,128, 3,3,1,1,))

Now we will see how to create a residual network from this plain network.

First, suppose the layer L has equal input(x) and output(y) dimensions. The the transformation is simple. Instead of y = L(x), we will have y = L(x) + x. This can be easily done by CAddTable module. Also a layer is taken as two consecutive convolutional layers as this creates better nonlinearities.

Here the unit represents the layer L

local cat = nn.ConcatTable()
local net = net or nn.Sequential()

Normally, to reduce the dimensions of a image we use MaxPooling. The paper uses convolutional layers with stride 2. They also increase the number of filters by doubling it. This keeps the computation at each layer the same. As the dimensions are changing at this layer, we will use y = L(y) + W * y. Here W is implemented by a convolutional layer with filters 1 and stride 2.

A unit here is convolutional layer with stride 2 and double filters

 local cat = nn.ConcatTable()
 local net = net or nn.Sequential()

So a residual network looks like:

  (1): nn.Reshape(1x28x28)
  (2): nn.SpatialConvolution(1 -> 64, 3x3, 1,1, 1,1)
  (3): nn.SpatialBatchNormalization
  (4): nn.ReLU
  (5): nn.ConcatTable {
      |`-> (1): nn.Sequential {
      |      [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> output]
      |      (1): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
      |      (2): nn.SpatialBatchNormalization
      |      (3): nn.ReLU
      |      (4): nn.SpatialConvolution(64 -> 64, 3x3, 1,1, 1,1)
      |      (5): nn.SpatialBatchNormalization
      |      (6): nn.ReLU
      |    }
      |`-> (2): nn.Identity
       ... -> output
  (6): nn.CAddTable
  (7): nn.ReLU
  (8): nn.ConcatTable {
      |`-> (1): nn.Sequential {
      |      [input -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> output]
      |      (1): nn.SpatialConvolution(64 -> 128, 3x3, 2,2, 1,1)
      |      (2): nn.SpatialBatchNormalization
      |      (3): nn.ReLU
      |      (4): nn.SpatialConvolution(128 -> 128, 3x3, 1,1, 1,1)
      |      (5): nn.SpatialBatchNormalization
      |      (6): nn.ReLU
      |    }
      |`-> (2): nn.SpatialConvolution(64 -> 128, 1x1, 2,2)
       ... -> output
  (9): nn.CAddTable
  (10): nn.ReLU


Using this we train  plain and residual networks on MNIST. As can be seen, the same depth of network, residual networks have better performance. We were also able to train 100 layer deep networks with no problem.

MNIST plain 34 layer net – 99.16%

MNIST residual 34 layer net – 99.23%

MNIST plain 100 layer net – 99.49%

Code is on github here.

They have also removed the final weight layer instead using global average pooling. I have not implemented that way instead using linear layer after max pooling. Using data augmentation and dropout may lead to better results. But here we only demonstrate how to train very deep networks.

Breaking reddit captcha with 96% accuracy

I have  written earlier about the vulnerability of captchas and how easy it is to break if we have the generating software to generate datasets for training deep neural networks.

No we will have a look at breaking reddit captcha assuming we don’t have access to generating software. This is an extension of a post detailing here. They say they have a success of about 10%. We will improve that drastically.

To create the training set, we only label 500 captchas. But due to an implementation issue of reddit captcha, we can get a large training set automatically. Follow this link to see the captcha of reddit. If you reload the captcha with the same url (make sure you are logged in reddit), you get a variation of the same captcha under a different transformation. We use this vulnerability to both create a larger training set and enhance our prediction confidence. For example, when using fixed url we get multiple samples of JOEPBO

Screen Shot 2016-01-05 at 10.40.11 am.png


As we see in earlier blogpost, segmentation is not necessary but we have an easy hack to segment which always helps. Here is an sample captcha image,


We can remove the background noise as they are at lower intensity, leaving us with the letters, we can further remove the dots by using connected components and segment into six seperate characters.

download (2).png


Character segmentation and training

Screen Shot 2016-01-05 at 1.13.19 pm

After we segment a captcha and label it, we can use multiple samples of the same captcha to augment the data set. So if we label 500 captchas, and have 100 samples of each captcha, we have 50000 pairs of images and labels. As our segmentation is pretty noisy, we can remove some samples where segmentation is not ideal(more than or less than 6 segments of characters or size of each segment). Approximately 50% of samples are discarded this way. Training on these segments this way leads to individual character accuracy of around 90% (of valid segments), which leads to around 60%(0.85^6) accuracy of the whole captcha (assuming segmentation is done properly). If segmentation is not accurate, we can get the correct captcha after seeing several captchas of the same text (remember we can get multiple samples of the same captcha with the url), depending on the confidence required. We get around 90% accuracy on test cases at 30 samples of each captcha. We also get around 75% at 10 samples and 96% at 100 samples. The code can be checked at github here.


So we see that reddit captcha has some basic vulnerabilities (removable noise and multiple samples) which makes it easily crackable. Once more we see captchas are ineffective and only leads to user inconvenience. Accuracy can be improved with sequence to sequence learning with first sequence as segments and text as the other.

Using deep learning to break a Captcha system

Using Torch code to break simplecaptcha with 92% accuracy

Captcha is used as a common tactic to stop bots from entering a website. Any visitor to a website is presented with a text image containing  some letters which can be read by a human and not by automated bots. They are quite frequently used to stop automated password hacking or automated login to websites etc. The following is taken from the wikipedia page[1] of captcha. As captchas are usually used to deter software programs they can be usually very hard to read and human accuracy can be around 93% [2]. It also takes something like 10 secs to read a captcha. As can be seen this takes quite a toll on the user experience.

Breaking Captchas

There are a few approaches to defeating CAPTCHAs: using cheap human labor to recognize them, exploiting bugs in the implementation that allow the attacker to completely bypass the CAPTCHA, and finally improving character recognition software.  There are numerous criminal services which solve CAPTCHAs automatically.

Neural networks have been used with great success to defeat CAPTCHAs as they are generally indifferent to both affine and non-linear transformations. As they learn by example rather than through explicit coding, with appropriate tools very limited technical knowledge is required to defeat more complex CAPTCHAs.

The only part where humans still outperform computers is Segmentation, i.e., splitting the image into segments containing a single letter. If the background clutter consists of shapes similar to letter shapes, and the letters are connected by this clutter, the segmentation becomes nearly impossible with current software. Hence, an effective CAPTCHA should focus on the segmentation.

We will show that with the increase of computer power(GPU) and advanced neural networks (deep learning), most of the popular captcha softwares can be broken. This has also been in news quite recently [3][4]. Here we demonstrate using a captcha software available for common public usage. This is just to show that most common captcha software are susceptible to this attack and should refrain from using captcha as a system to stop bots.

Deep Machine Learning

Machine learning can be seen as trying to find a function given examples of input and output of that function. So if y=f(x), here x can be the captcha image and y can be text to be read from the image. We are trying to fit a function f using machine learning to read text from the image. The great thing about machine learning is that if we provide the pairs of image and text, the ml is generic enough and tries to figure out the underlying function. Here we use deep neural networks as out machine learning algorithm. So the two main steps are to generate the pairs of images and text which is one of the main tasks in machine learning applications. But we can easily generate the text and image pairs by using the software used by the website. The following excerpt is taken from wikipedia ..

 … the algorithm used to create the CAPTCHA must be made public, though it may be covered by a patent. This is done to demonstrate that breaking it requires the solution to a difficult problem in the field of artificial intelligence (AI) rather than just the discovery of the (secret) algorithm, which could be obtained through reverse engineering or other means.


Here we will try to break Simple Captcha which is a java based captcha software. Some examples of captchas generated are

Screen Shot 2016-01-03 at 11.50.57 pm

This library we picked randomly and we can see that the images are quite difficult even for humans. To create the dataset  we use the software itself. I have created a dataset of 10000 samples of 5 characters each using all the effects and noises available in the library to make it harder.


We use torch to train the neural network and VGG based deep neural network architecture. We train on a GPU Nvidia 780 Titan. You can check similar set up for CIFAR dataset at this blog post. The main difference is the criterion, as the output of a image is a sequence of characters which begs us to use RNN. But that is also not necessary here as we use our custom MultiCrossEntropyCriterion and is sufficient to get good enough results. I have committed the code to github repository here. We assume the dataset to be in a folder (named dataset). The shown image size is 50×100. The network can be changed accordingly to fit any other captcha. Be careful to also change the number of classes and text size in file model.lua.

First create the dataset in a folder in the required format as shown. Then edit the parameters in captcha.lua to suit your needs:

  • set validation data size
  • set batch size to fit the training of network in your GPU
  • set the number of iterations to run
  • set learning rate, learning rate decay and momentum

I have not included the training data or code to generate the dataset, but should be simple enough to do. Using this setup and training for a couple of hours I was able to get 92% accuracy on the dataset.


So without much effort and using standard deep learning library and generating datasets automatically and using huge computational power, we can break standard captcha systems in a days work. This shows that reliance on captchas for stopping bots should be stopped. Making captchas harder only makes it harder for humans and not this attack. Using RNN with attention can improve results further. And using custom captcha generator only makes it slightly difficult in generating dataset which can be circumvented by humans (mechanical turk).


So what are the alternatives to captchas. Use recaptcha by google. They have realized that captchas are susceptible to attacks and hence moved away from them. They used advanced machine learning techniques to monitor bot traffic and stop them.


  1. wikipedia
  2. human accuracy
  3. google captcha breaking
  4. vicarious
  5. paper