Handwriting recognition

History / Edit / PDF / EPUB / BIB
Created: June 23, 2016 / Updated: September 19, 2017 / Status: in progress / 12 min read (~2244 words)

Handwriting recognition has been one of the first task to interest machine learning and AI researchers. The initial goal was to rapidly process IRS forms and convert them into their digital equivalent. This meant that a large amount of handwritten content was available, all that was missing was to label it and then develop tools to convert the characters image into their ASCII equivalent.

  • BCCS: BiConnected/Binary Connected Components

  • Can handwriting recognition be taught through the process of teaching a network how to write?

  • How to detect characters?
    • How to support multi-scale characters recognition?
  • How to group characters together to form words/large numbers?
  • How do you group together words in order to form lines?
  • How to improve word recognition accuracy using a vocabulary?
  • How do you properly classify characters?
    • Given that your average MNIST neural network is trained on a 28x28 image with 256 gray values, you have a space of $255^768$ to cover

  • Content is written on white sheets of papers
  • Paper might have lines (loose leaf sheet)
  • Paper may have various formats (generally 8.5"x11")
  • Glyphs are generally written using a color that has a large contrast with the sheet/background color
  • Text may be blurry due to the capture device
  • Image size is expected to vary (due to the capture device)
  • PPI may vary
  • Language may vary and many languages may be used within the same page
  • Page may contain one or many images/drawing

  • Preprocess image
    • RGB to gray (0-255) or black and white (0-1)
    • Canny edge detection
    • Noise removal (Gaussian, salt and pepper)
  • Line extraction
    • Cropping of the text region
    • Vertical scan to find blank rows => line separator
  • Letter extraction
    • Horizontal scan to find blank columns => character separator

  • A memory model that is used to remember where we are currently looking at (either through landmarks or some form of (x, y) coordinates)
  • A reading (attention) model that knows how to scan pages (depending on the language)
  • A character recognition model

DNN: 2 layers of dense 512 units with relu activation, with a dropout layer of ratio 0.2, softmax activation on the output layer, optimizing categorical cross-entropy

Based off https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py.

CNN: 2 convolutional layers of kernel size (3, 3) with relu activation, then max pooling (2, 2), 0.25 dropout, flatten, dense 128 with relu activation, 0.5 dropout and finally a softmax output layer, optimizing categorical cross-entropy

Based off https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py.

Test set size is 25% of the training set size (thus 20% of the total data set size). The training and test sets are fixed.

Network Character classes Training set size Training duration per epoch (s) Test loss Test accuracy
DNN Digits 39990 3 0.103 0.973
Lowercase letters 103974 8 0.295 0.905
Uppercase letters 103974 9 0.238 0.936
Letters 207948 24 0.700 0.712
All 247938 32 0.794 0.688
CNN Digits
Lowercase letters
Uppercase letters

  • To evaluate what should be improved/worked on, note that a sequential pipeline is the most affected by its earlier components, and thus the accumulation of errors early on will propagate to the further layers

  • Parent read to them by pointing at the part of the text they are reading
  • They point to object they recognize and link the word to the object
  • An association between known words, their phonetics and how they are written is built up over time

See 1.

  • Pre-alphabetic
  • Partial alphabetic
  • Full alphabetic
  • Consolidated alphabetic

  • Multi-scale character detection via sliding window classification
  • Use of Random Ferns due to the large number of categories (62 => 26 upper/lower + 10 digits)
    • Naturally multi-class and efficient both to train and test
  • The features consist of applying randomly chosen thresholds on randomly chosen entries in a HOG descriptor computed at the window location
  • Application of non-maximal suppression

  • Glyphs have the prior (or are conditioned on the fact) that there's high probability that the pixel in every direction is likely to be part of the glyph as well (continuity) and that if it's too different, then it's likely to not be part of the glyph