0

My task is to parse the tabular data from pdf. I am using "tika" library in python which is great but with one problem as below:

Pdf has text in tabular format, and half of the row ends in 2nd page this divides the key and value data of the table in two different pages and I think tika is treating this single row as two separate rows.

enter image description here

The output will add value in-between the key which is not right.

For Example :

str = "This is the long key data xxxxxxx value xxxxxxxxx remaining key data"

Any suggestions?

1 Answers1

0

You can try to experiment with tesseract psm : Pytesseract OCR multiple config options

To set the different psm in tika (1 is default value) you can either: use the header: X-Tika-OCRPageSegMode: xx or use the tesseract config: https://tika.apache.org/1.24/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html#setPageSegMode-java.lang.String-

marek.kapowicki
  • 674
  • 2
  • 5
  • 17