Obtaining data from a PDF file with the same layout as with a copy+paste

Question

I have a procedure which I'm looking to automate which envolves getting a series of tables from a PDF file. Currently I can do so by opening the file in any viewer(Adobe, Sumatra, okular, etc...) and just Ctrl+A, Ctrl+C, Ctrl+V it unto Notepad, and it mantains each line aligned with a reasonable enough format that then I can just run a regex and copy and paste it into Excel for whatever is needed afterwards.

When trying to do this with python I tried various modules, PDFminer the main one which sort of works by using this example for instance. But it returns the data in a single column. Other options include just getting it as an html table, but in this case it adds extra splitting mid-table which make the parsing more complicated or even switches columns around between the first and second pages occasionally.

I've gotten a temporary solution working for now, but I'm worried I'm reinventing the wheel when I'm probably just missing a core option in the parser or that I need to consider some fundamental option of the way the PDF renderer works to solve this.

Any ideas from how to approach it?

Did you find a workaround/solution using the pdfminer Python library to maintain the layout of the output text the same as the PDF document? Looking at the source code, there is a [LAParams class](https://github.com/goulu/pdfminer/blob/master/pdfminer/layout.py#L32) which can control the layout params, but specifying the right values is a trial and error endeavor. Usage example: [extract_text_to_fp](https://github.com/goulu/pdfminer/blob/master/pdfminer/high_level.py#L21). I think I'm going to use `pdftotext -layout input.pdf output.txt` , see: http://askubuntu.com/q/52040 — Alex Bitek, Jan 01 '17 at 18:06
I did find it yes, but forgot to provide the answer due to the rush in implementation. I'll check the code and provide it in a few minutes. — Drexer, Jan 02 '17 at 18:26

score 1 · Accepted Answer · edited May 23 '17 at 12:32

1

I ended up implementing a solution based on this one, by itself modified from a code by tgray. It works consistently in all of the cases I've tested so far, but I have yet to identify how to manipulate pdfminer's parameters directly to obtain the desired behaviour.

edited May 23 '17 at 12:32

Community

1
1

answered Jan 02 '17 at 19:32

Drexer

31
5

score 1 · Answer 2 · edited May 23 '17 at 12:25

Posting this just to get a piece of code out there that works with py35 for csv-like parsing. The splitting in columns is simplest possible but worked for me.

Crudos to tgray in this answer as a starting point.

Also put in openpyxl since I prefered to have the results directly in excel.

# works with py35 & pip-installed pdfminer.six in 2017
def pdf_to_csv(filename):
    from io import StringIO
    from pdfminer.converter import LTChar, TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item._objs:
                if isinstance(child, LTChar):
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child.get_text()
                    # the line is now an unsorted dict

            for y in sorted(lines.keys()):
                line = lines[y]
                # combine close letters to form columns
                xpos = tuple(sorted(line.keys()))
                new_line = []
                temp_text = ''
                for i in range(len(xpos)-1):
                    temp_text += line[xpos[i]]
                    if xpos[i+1] - xpos[i] > 8:
                        # the 8 is representing font-width
                        # needs adjustment for your specific pdf
                        new_line.append(temp_text)
                        temp_text = ''
                # adding the last column which also manually needs the last letter
                new_line.append(temp_text+line[xpos[-1]])

                self.outfp.write(";".join(nl for nl in new_line))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())

    fp = open(filename, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(PDFPage.get_pages(fp,
                                pagenos, maxpages=maxpages,
                                password=password,caching=caching,
                                check_extractable=True)):
        outfp.write("START PAGE %d\n" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

fn = 'your_file.pdf'
result = pdf_to_csv(fn)

lines = result.split('\n')
import openpyxl as pxl
wb = pxl.Workbook()
ws = wb.active
for line in lines:
    ws.append(line.split(';'))
    # appending a list gives a complete row in xlsx
wb.save('your_file.xlsx')

Obtaining data from a PDF file with the same layout as with a copy+paste

2 Answers2