3

I found some code for pdf data extraction from a user here on stackoverflow. But looking at the output it extracts column by column. Is there a way to get pdfminer.six to read the data row by row?

This is the code I used (just slightly modified compared to the original and removed comments for readability). Here is also a screenshot from the current output with an example pdf.

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
import pdfminer


fp = open('test.pdf', 'rb')

parser = PDFParser(fp)

document = PDFDocument(parser)

if not document.is_extractable:
    raise PDFTextExtractionNotAllowed

rsrcmgr = PDFResourceManager()

device = PDFDevice(rsrcmgr)

laparams = LAParams()

device = PDFPageAggregator(rsrcmgr, laparams=laparams)

interpreter = PDFPageInterpreter(rsrcmgr, device)

def parse_obj(lt_objs):

    for obj in lt_objs:
        if isinstance(obj, pdfminer.layout.LTTextBoxHorizontal):
            print("{}".format(obj.get_text().replace("\n", "")))
        elif isinstance(obj, pdfminer.layout.LTFigure):
            parse_obj(obj._objs)

for page in PDFPage.create_pages(document):
    interpreter.process_page(page)
    layout = device.get_result()

    parse_obj(layout._objs)

Thanks in advance.

riffel
  • 31
  • 2

0 Answers0