8

I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can be of different number of columns. Manually entering them into Excel is mundane to say the least.

I thought I can save myself a week of work if I can put a program to OCR it. Would it be possible to detect headers text areas with the OpenCV and OCR the text behind the detected image coordinates.

Can I achieve this with the help of OpenCV or do I need entirely different approach?

Edit: Example table is really just a standard table similar to what you can see in Excel and other spread-sheet applications, see below.

enter image description here

Datageek
  • 25,977
  • 6
  • 66
  • 70
  • Yes, you can. But it'll be hard to get 100% perfect results, unless you have well defined constraints. Can you show some of your scanned tables? – Miki Oct 31 '15 at 14:43
  • Can you please provide one or two scanned documents? The quality of the scansion has a large impact on the final result. – Miki Nov 02 '15 at 11:40
  • 1
    The problem has 2 parts: 1. Recognizing and extracting the table 2. OCR The first part is relatively easy and you can find tutorials like: http://www.shogun-toolbox.org/static/notebook/current/Sudoku_recognizer.html OCR is relatively tougher. From my experience, it works reliably enough to need very little human intervention only in cases with high quality scans or images of printed fonts. Making the OCR engine is possible in many ways.. from SVM to deep learning. You can find tutorials which suit your expertise. – Karan Dwivedi Jan 02 '16 at 10:05
  • @Datageek, I am also working on similar stuff. Could you share your experience if you have managed to convert row in to data, when each cell has more words? – explorer Nov 17 '18 at 03:58

1 Answers1

6

This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.

For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:

  1. Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This will require some pre-processing like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in opencv.

  2. Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.

  3. Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results. There are other issues also but these are the basic ones.

In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.

  1. Simple image binarization will remove the background content leaving only necessary content as shown here. Binary image
  2. Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.

    enter image description here

  3. For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:

    Caption title

    header! header2 header3

    row1cell1 row1cell2 row1cell3

    row2cell1 row2cell2 row2cell3

  4. As we can see here that result is quite accurate but there are some issues like header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.

After post processing the OCR result it can be parsed to read the row and column values.

Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.

flamelite
  • 2,654
  • 3
  • 22
  • 42
  • Thanks for the detailed answer @flamelite. Do you know any open source code that the can do this? Did you publish your solution perhaps? – Datageek Nov 04 '17 at 09:36
  • 1
    i am not sure for any specific open sourced application which does all the steps as mentioned. I did not developed any application which does all above things in one button click. – flamelite Nov 04 '17 at 10:23
  • Can you explain how you performed the second step, for removing the table? – Mooncrater May 28 '18 at 11:26
  • 1
    You can get the list of all pixels in a connected component as described here https://docs.opencv.org/3.1.0/d3/dc0/group__imgproc__shape.html#gae57b028a2b2ca327227c2399a9d53241 and then convert those pixels color to background color. – flamelite May 28 '18 at 11:36
  • @flamelite, Thanks for sharing your experience. I am working on similar stuff. Did you also handled scenarios where each cell would have more words, which would make figuring out cell content impossible, for example, when a row with 8 column has 10 words? If so, could you please share how you solved it? – explorer Nov 17 '18 at 04:01
  • @explorer i did not consider your scenario, but above approach should work even if a single cell contains multiple words. – flamelite Nov 17 '18 at 10:58
  • @flamelite I understand that the current approach will work for multiple words. But, how do I know that which word belongs to which cell? For example, to represent 4 column table in JSON array format ```[{"col1": "data", "col2":"data","col3":"data", "col4":"data" }]```, if each column has one word, I can simply split by space, as I will get an array of 4 words, which is same as number of columns. In case, if there is more than one word in one or more columns, splitting by space will result in an array, which is more than number of columns. Are you able to see any solution for this? – explorer Nov 19 '18 at 05:57
  • I can understand you problem, It can be solved my many ways. One way would be to first define each cells bounding box using corner points obtained from horizontal and vertical lines in the table and Using OCR library like Tesseract you can get the coordinate locations of each word, i believe this information would be sufficient to cluster multiple words of the same cell. – flamelite Nov 19 '18 at 06:34
  • @flamelite Yep, I did the same. It works for most scenarios. – explorer Nov 29 '18 at 07:31