I am trying to parse some PDF documents (1.7 format) to extract numeric data.
I am using the python PDF miner package and it works well.
For most of the document, a row in a table will be converted to a single text string. But sometimes 2 adjacent rows will have there content interleaved into a single string. For example:
The visual appearnce within the PDf is as follows
LZX DEC-18 13.95 .00 0 0 0 0 0 0 0
Totals for LZX: 0 3 481 0 0 0 0 0
But the extracted text looks like this, appearing in column rather than row order
---
LZX
Totals for LZX:
DEC-18
13.95
.00
0
0
0
3,481
0
0
0
0
0
0
0
0
0
0
I cannot see any options within the PDF miner script that would alter this. So I'm assuming it is something to do with the way the PDF dcoument is originally created?
It makes parsing quite difficult, so would be handy to know when this might occur.