PDFMiner - Get text lines

Question

I'm converting PDF files to text with the PDFMiner Python library, using the code snippet provided in this SO answer. The problem is that the PDF is three column formatted, and I need to read each line. However, the text I get is unordered: sometimes mixes the first and second column, sometimes mixes the third one... As the text does not follow any logical order, I can't parse each line. So, is there any way to get each individual line of the PDF file using PDFMiner?

EDIT:

PDFMiner comes with a command line tool, pdf2txt.py, to convert PDF to text. Playing with it and setting 0.05 as word margin, I could get a better formatted text, but could not achieve the goal.

score 0 · Answer 1 · answered Aug 06 '13 at 08:22

0

I had a similar when parsing tables*. What worked for me was to exctract HTML. Then you can parse the HTML table and take the table tags into account (see python documentation for the HTMLParser.) I only had tables to find, tho.

My two cents :)

*Tables from word copied into QT TextEdit widget. Widget accepts rich text, but the tables would be mucked up if exported as text. Exported as HTML, parsed HTML, got data :) Did this at work, don't have the code here.

answered Aug 06 '13 at 08:22

Petter TB

147
5

can you please add a link where to find the documentation for the HTMLParser. thanks! – yishairasowsky Feb 19 '20 at 09:27
do you not mean pdfminer.converter.HTMLConverter? the link being https://programtalk.com/python-examples/pdfminer.converter.HTMLConverter/ – yishairasowsky Feb 19 '20 at 09:31

score 0 · Answer 2 · answered Jan 20 '23 at 00:02

0

While working on a similar problem, I stumbled over a somewhat solution for this problem. You can set the LAParams of the extract_text as follows:

from pdfminer.layout import LAParams

laparams = LAParams(boxes_flow=None)

and then pass it through where extract_text is used:

text = extract_text(filename, laparams= laparams)

This way, I am getting text that's way more representative of the horizontal and vertical layout of the actual PDF Page.

answered Jan 20 '23 at 00:02

Silsen

1

what is extract_text here? – m9m9m Mar 14 '23 at 11:20
@m9m9m it's a function from pdfminer. You can import it by using `from pdfminer.high_level import extract_pages` assuming you have installed pdfminer – Silsen Mar 15 '23 at 15:49

PDFMiner - Get text lines

2 Answers2