I have a multitude of PDFs with different structures and i need to extract the text from them and find some key indicators.
I am using pyPdf module and in case the PDFs is not retriving any text, I am also using PDF Miner.
The problem is that for some of the files, no modules work, in the sense that no text is extracted from the PDF. I saw that some of them are scanned or only image PDF but some of them appear to have a constant structuture as the ones that can be parsed.
Here are the 2 functions I use, maybe I am missing something:
Using pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
content += pdf.getPage(i).extractText() + " "
content = " ".join(content.replace(u"/xa0", " ").strip().split())
return content
mt = getPDFContent(filename).encode("ascii", "xmlcharrefreplace")
Using PDF Miner
def getPDFContent(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
retstr.write("nextpage")
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text