3

I'm converting PDF files to text with the PDFMiner Python library, using the code snippet provided in this SO answer. The problem is that the PDF is three column formatted, and I need to read each line. However, the text I get is unordered: sometimes mixes the first and second column, sometimes mixes the third one... As the text does not follow any logical order, I can't parse each line. So, is there any way to get each individual line of the PDF file using PDFMiner?

EDIT:

PDFMiner comes with a command line tool, pdf2txt.py, to convert PDF to text. Playing with it and setting 0.05 as word margin, I could get a better formatted text, but could not achieve the goal.

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
davids
  • 6,259
  • 3
  • 29
  • 50

2 Answers2

0

I had a similar when parsing tables*. What worked for me was to exctract HTML. Then you can parse the HTML table and take the table tags into account (see python documentation for the HTMLParser.) I only had tables to find, tho.

My two cents :)

*Tables from word copied into QT TextEdit widget. Widget accepts rich text, but the tables would be mucked up if exported as text. Exported as HTML, parsed HTML, got data :) Did this at work, don't have the code here.

Petter TB
  • 147
  • 5
0

While working on a similar problem, I stumbled over a somewhat solution for this problem. You can set the LAParams of the extract_text as follows:

from pdfminer.layout import LAParams

laparams = LAParams(boxes_flow=None)

and then pass it through where extract_text is used:

text = extract_text(filename, laparams= laparams)

This way, I am getting text that's way more representative of the horizontal and vertical layout of the actual PDF Page.

Silsen
  • 1