1

I am trying to parse some PDF documents (1.7 format) to extract numeric data.

I am using the python PDF miner package and it works well.

For most of the document, a row in a table will be converted to a single text string. But sometimes 2 adjacent rows will have there content interleaved into a single string. For example:

The visual appearnce within the PDf is as follows

LZX DEC-18         13.95 .00 0     0 0 0 0 0 0
Totals for LZX:              0 3 481 0 0 0 0 0

But the extracted text looks like this, appearing in column rather than row order

---
LZX
Totals for LZX:

DEC-18

13.95

.00

0
0

0
3,481

0
0

0
0

0
0

0
0

0
0

I cannot see any options within the PDF miner script that would alter this. So I'm assuming it is something to do with the way the PDF dcoument is originally created?

It makes parsing quite difficult, so would be handy to know when this might occur.

Kim Ryan
  • 515
  • 1
  • 3
  • 11
  • 1
    What's the question? – Burhan Khalid Oct 02 '14 at 12:32
  • Welcome to Stack Overflow. This is not a good way to ask a question here. Did you try anything so far to solve your problem? Show your effort first so people might show theirs. Please read [FAQ](http://stackoverflow.com/tour), [How to Ask](http://stackoverflow.com/help/how-to-ask) and [help center](http://stackoverflow.com/help) as a start. – Nahuel Ianni Oct 02 '14 at 12:37
  • My question is in the last 2 paragraphs. What determines the order that different table cells appear in the converted text stream? Most of the time it occurs left to right and then down the page, in a sequence you would expect. But not always as per the example. I want to know why this occurs. – Kim Ryan Oct 02 '14 at 12:38
  • Hi Nahuel, I have given a detailed description of the problem. My solution is a workaround to scan for text not appearing in row order. But this is messy, so I would like to know the PDF text order in variant. I am seeking some input on how PDF order is determined. Not sure how this can be interpreted as 'not trying to solve a problem'. – Kim Ryan Oct 02 '14 at 12:46
  • Ok, no actual advice arriving from subject matter experts so I did my own searching. Looking at http://stackoverflow.com/questions/1848464/advanced-pdf-parsing-using-python-extracting-text-without-tables-etc-whats/1851011#1851011, the relevant sections is "in a PDF, the text is not continous, but made from a lot of small groups of characters positioned absolutely in the page. The focus of PDF is to keep the layout intact. It's not content oriented but presentation oriented". So it seems the only solution would be at the PDF creation stage using a tool such as Acrobat. – Kim Ryan Oct 06 '14 at 00:51
  • Could the above be considered an an answer? I am new here, but I have been blocked from adding answers after only submitting 2. I was trying to learn the protocols here, but was barred so quickly, there was no chance for me to edit or improve my answers. I can't even raise this on meta to have it reviewed. I did flag a problem with my blocked answer, but even this was ignored. – Kim Ryan Oct 06 '14 at 00:55

1 Answers1

1

My initial assumptions about PDF rendering were that it would be similar to a raster oyutput that a printer performs. That is text would be created first from left to right within a line, and then step down a line.

But realise that this is incorrect, and the rendering pattern set by the PDF producer is more like what an X-Y plotter could produce, with a emphasis on object proximity over scan direction.

My conclusion is that PDF scanning is inherently difficult as no assumptions can be made about text ordering within a page. The solution, where possible, is it go back to the source document that the PDF was generated from. If it is tabular in structure, it is likely to be easy to retrieve all the data from this format.

Kim Ryan
  • 515
  • 1
  • 3
  • 11