Tesseract: Open-source OCR library for Java
Weeks ago I was given a task to read values from an e-commerce website. The idea was simple: a link was given, the application should parse the content of the HTML, download the specific value and store it. I decided to use a crawler instead, but this is another story. That’s said, the task was indeed simple, except for the fact that values were images instead of text.
So, I started my research on Optical Character Recognition (OCR) with two things in mind:
- I should not re-invent the wheel. Probably, there are couple of efficient libraries out there for this purpose. I would save time, and time is money, right?
- It must be open-source. Not because it really must, but because I would like it to be.
Making the story short, my research ended up with tesseract-ocr. This OCR engine fulfills the criteria above, its usage is straightforward and, finally, it has been improved by Google (if you are a developer, you know, there is a status on it).
In few lines, here is the basic usage:
To start with, ImageIO.read(url)
simply reads the image from the URL.
Next, there is dumb validation because I knew that even the lowest values wouldn’t fit in the dimension of 20x20 pixels.
While trying the solution with few examples, I realized that the images I was handling were too small for tesseract.
I know that scaling the images wouldn’t improve their quality, because the resolution itself was poor. However, with SCALE = 2
(for my specific case), I could place the image size in a range that tesseract is able to scan.
Notice that scaledPriceImg
at that point is just a “frame” twice the size of my original image.
Afterwards, I used a flag USE_GRAPHICS
to be able to switch between two different scaling conversion methods.
The first one is using Graphics2D
. The visual result of this scale is satisfactory, but a bit blurry for OCR tools.
The second method uses AffineTransform
. This utility provides three algorithms for scaling, and here it’s denoted by my global variable AFFINE_TRANSFORMATION_TYPE
.
The candidates are AffineTransformOp.TYPE_BICUBIC
, ...TYPE_BILINEAR
and ...TYPE_NEAREST_NEIGHBOR
.
I choose the bicubic transformation with AffineTransform
because it was more accurate for the figures I wanted to scan. This decision depends pretty much on your application.
Finally, I’m invoking the OCR tool itself Tesseract.getInstance().doOCR(scaledPriceImg)
, followed by a formatting and a conversion to double.
After running the application for over 500 images, I’ve got an accuracy of around 95%.
The usage of Tesseract is really straightforward, but I realized that the pre-processing of images was the most relevant issue, with heavy impact on my results.
The OCR module for my specific scenario can be found here.