21

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?)

I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it something to do with me following a python2.7 tutorial and trying to translate it to python3?

errors: File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 28, in <module>
    banana = convert("A1.pdf")
  File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 19, in convert
    infile = file(fname, 'rb')
NameError: name 'file' is not defined

script

from io import BytesIO

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = BytesIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

banana = convert("A1.pdf")
print(banana)

The same thing happens with this variant:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = BytesIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

Banana = convert_pdf_to_txt("A1.pdf")
print(Banana)

I have tried searching for this (most of the pdfminer code is from this or this) but having no luck.

Any insight is appreciated.

Cheers

Community
  • 1
  • 1
gary
  • 223
  • 1
  • 2
  • 8
  • Check out below answer which works in May 2020 and is very simple: from pdfminer.high_level import extract_text then text = extract_text('report.pdf') stackoverflow.com/a/61857301/7483211 – Cornelius Roemer May 19 '20 at 13:40
  • `file()` is replaced by `open()` in Python 3. See my answer below: https://stackoverflow.com/a/69962200/4054971 – Pieter Nov 14 '21 at 10:38

5 Answers5

36

There is a solution for Python 3.5: you need pdfminer.six. Under win10 I could easy install it with

pip install pdfminer.six

You can check the installed version with

pdfminer.__version__

I haven't tested it intensively yet. But I could run the following code for the conversion pdf→text and pdf→html

pyano
  • 1,885
  • 10
  • 28
  • also, pdfminer.six seems maintained up to november 18! hooray. – benzkji Jun 20 '19 at 07:45
  • @pyano You mention "could run the following code for the conversion pdf→text and pdf→html" BUT there is NO code following. Working code example in May 2020 is here: https://stackoverflow.com/a/61857301/7483211 – Cornelius Roemer May 17 '20 at 19:23
  • @Cornelius Roemer The code is just below in the next answer: Improved solution (Dez 2016) – pyano Jun 10 '20 at 15:48
13

Improved solution (Dez 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
    if not pages: pagenums = set();
    else:         pagenums = set(pages);      
    manager = PDFResourceManager() 
    codec = 'utf-8'
    caching = True

    if case == 'text' :
        output = io.StringIO()
        converter = TextConverter(manager, output, codec=codec, laparams=LAParams())     
    if case == 'HTML' :
        output = io.BytesIO()
        converter = HTMLConverter(manager, output, codec=codec, laparams=LAParams())

    interpreter = PDFPageInterpreter(manager, converter)   
    infile = open(fname, 'rb')

    for page in PDFPage.get_pages(infile, pagenums,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()  

    infile.close(); converter.close(); output.close()
    return convertedPDF

#//////////// main ///////////////////////
filePDF  = 'myDir//myPDF.pdf'     # input
fileHTML = 'myDir//myHTML.html'   # output
fileTXT  = 'myDir//myTXT.txt'     # output

case = "HTML"

if case == 'HTML' :
    convertedPDF = convert('HTML', filePDF, pages=[0,1])
    fileConverted = open(fileHTML, "wb", encoding="utf-8")
if case == 'text' :
    convertedPDF = convert('text', filePDF, pages=[0,1])
    fileConverted = open(fileTXT, "w", encoding="utf-8")

fileConverted.write(convertedPDF)
fileConverted.close()
#print(convertedPDF) 
Apache
  • 23
  • 7
pyano
  • 1,885
  • 10
  • 28
2

In my case on Python 3.7 I tried using it and it worked like a charm for me!

here is the code I used:

def convert_pdf_to_txt(path_to_file):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path_to_file, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
Muhammad Haseeb
  • 634
  • 5
  • 20
  • which package did you install? the classis pdfminer states it doesnt support python3? https://euske.github.io/pdfminer/index.html – benzkji Jun 20 '19 at 07:42
  • 1
    @benzkji You are right you need to install pdfminer.six for Python3. More details here https://github.com/pdfminer/pdfminer.six – Muhammad Haseeb Jun 20 '19 at 14:56
0

The function file() was a built-in function in Python 2.7. But it is not a built-in function in Python 3.5 anymore.

You should change file() into open().

Pieter
  • 3,262
  • 1
  • 17
  • 27
-2

pdfminer doesn't support python version 3.5. It works only in Python 2.6 or newer. I faced the same issue try using python version 2.6 it will solve your problem.

animal
  • 994
  • 3
  • 13
  • 35