Python read pdf in sections

Question

I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form

A | B | C
D | E | F
G | H | I

Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar cards in the other 8 slots. I'd like to be able to read A, then B, then C…; however, I could survive if I read the first line of the A, B, and C, and then the second line of A, B, and C, etc. I've looked at pdfminer and pypdf, but I haven't seen anything to fit what I'm looking for. The answer here works fairly well, but the order of
columns routinely gets distorted.

See my answer [here](http://stackoverflow.com/a/30676480/754254) for another attempt — Felipe, Jun 05 '15 at 22:07

score 1 · Accepted Answer · edited May 23 '17 at 11:51

1

In the second answer here replace

self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2]))

by

self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2], x[1]))

Very important: See the last paragraph of this answer.

edited May 23 '17 at 11:51

Community

1
1

answered Apr 25 '15 at 12:00

Manuel Antón

36
5

score 0 · Answer 2 · answered Apr 21 '15 at 17:36

0

I wasn't able to come up with a perfect solution, but the following works best for what I need.

import PyPDF2
from StringIO import StringIO
def getPDFContent(path, pages=[]):
    content = ""
    p = file(path, "rb")
    pdf = PyPDF2.PdfFileReader(p)
    if pages:
        for i in pages:
            content += pdf.getPage(i).extractText() + "\n"
    else:
        numPages = pdf.getNumPages()
        for i in range(numPages):
            content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content

answered Apr 21 '15 at 17:36

Pistol Pete

1,027
2
12
25

Can you please tell what second last line is doing? – Ashish Pani Jan 18 '18 at 10:30
@AshishPani it's been about 3 years since I've looked at this, but I think that I was getting the byte ``\xa0`` where I wanted spaces, but then I was also getting extra white space, so that's why I'd take the content, and replace the byte with spaces, strip out extra white space and join things together. – Pistol Pete Jan 19 '18 at 04:13

Python read pdf in sections

2 Answers2