I'm converting PDF files to text with the PDFMiner Python library, using the code snippet provided in this SO answer. The problem is that the PDF is three column formatted, and I need to read each line. However, the text I get is unordered: sometimes mixes the first and second column, sometimes mixes the third one... As the text does not follow any logical order, I can't parse each line. So, is there any way to get each individual line of the PDF file using PDFMiner?
EDIT:
PDFMiner comes with a command line tool, pdf2txt.py
, to convert PDF to text. Playing with it and setting 0.05
as word margin, I could get a better formatted text, but could not achieve the goal.