My task is to parse the tabular data from pdf. I am using "tika" library in python which is great but with one problem as below:
Pdf has text in tabular format, and half of the row ends in 2nd page this divides the key and value data of the table in two different pages and I think tika is treating this single row as two separate rows.
The output will add value in-between the key which is not right.
For Example :
str = "This is the long key data xxxxxxx value xxxxxxxxx remaining key data"
Any suggestions?