2

I'm trying to read a pdf file where each page is divided into 3x3 blocks of information of the form

A | B | C
D | E | F
G | H | I

Each of the entries is broken into multiple lines. A simplified example of one entry is this card. But then there would be similar cards in the other 8 slots. I'd like to be able to read A, then B, then C…; however, I could survive if I read the first line of the A, B, and C, and then the second line of A, B, and C, etc. I've looked at pdfminer and pypdf, but I haven't seen anything to fit what I'm looking for. The answer here works fairly well, but the order of
columns routinely gets distorted.

Community
  • 1
  • 1
Pistol Pete
  • 1,027
  • 2
  • 12
  • 25

2 Answers2

1

In the second answer here replace

self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2]))

by

self.rows = sorted(self.rows, key = lambda x: (x[0], -x[2], x[1]))

Very important: See the last paragraph of this answer.

Community
  • 1
  • 1
0

I wasn't able to come up with a perfect solution, but the following works best for what I need.

import PyPDF2
from StringIO import StringIO
def getPDFContent(path, pages=[]):
    content = ""
    p = file(path, "rb")
    pdf = PyPDF2.PdfFileReader(p)
    if pages:
        for i in pages:
            content += pdf.getPage(i).extractText() + "\n"
    else:
        numPages = pdf.getNumPages()
        for i in range(numPages):
            content += pdf.getPage(i).extractText() + "\n"
    content = " ".join(content.replace(u"\xa0", " ").strip().split())
    return content
Pistol Pete
  • 1,027
  • 2
  • 12
  • 25
  • Can you please tell what second last line is doing? – Ashish Pani Jan 18 '18 at 10:30
  • @AshishPani it's been about 3 years since I've looked at this, but I think that I was getting the byte ``\xa0`` where I wanted spaces, but then I was also getting extra white space, so that's why I'd take the content, and replace the byte with spaces, strip out extra white space and join things together. – Pistol Pete Jan 19 '18 at 04:13