how to extract tabular data from pdf properly when a row data is divided in two separate pages?

Question

My task is to parse the tabular data from pdf. I am using "tika" library in python which is great but with one problem as below:

Pdf has text in tabular format, and half of the row ends in 2nd page this divides the key and value data of the table in two different pages and I think tika is treating this single row as two separate rows.

enter image description here

The output will add value in-between the key which is not right.

For Example :

str = "This is the long key data xxxxxxx value xxxxxxxxx remaining key data"

Any suggestions?

score 0 · Answer 1 · answered Jan 08 '21 at 10:04

You can try to experiment with tesseract psm : Pytesseract OCR multiple config options

To set the different psm in tika (1 is default value) you can either: use the header: X-Tika-OCRPageSegMode: xx or use the tesseract config: https://tika.apache.org/1.24/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setPageSegMode-java.lang.String-

how to extract tabular data from pdf properly when a row data is divided in two separate pages?

1 Answers1