Pdfminer python 3.5

Question

I have followed a few tutorials around but I am not able to get this code block to run, I did the necessary switches from StringIO to BytesIO (I believe?)

I am unsure why 'banana' is printing nothing, I think the errors might be red herrings? is it something to do with me following a python2.7 tutorial and trying to translate it to python3?

errors: File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 28, in <module>
    banana = convert("A1.pdf")
  File "/Users/foo/PycharmProjects/Try/Pdfminer.py", line 19, in convert
    infile = file(fname, 'rb')
NameError: name 'file' is not defined

script

from io import BytesIO

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = BytesIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

banana = convert("A1.pdf")
print(banana)

The same thing happens with this variant:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import BytesIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = BytesIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

Banana = convert_pdf_to_txt("A1.pdf")
print(Banana)

I have tried searching for this (most of the pdfminer code is from this or this) but having no luck.

Any insight is appreciated.

Cheers

Check out below answer which works in May 2020 and is very simple: from pdfminer.high_level import extract_text then text = extract_text('report.pdf') stackoverflow.com/a/61857301/7483211 — Cornelius Roemer, May 19 '20 at 13:40
`file()` is replaced by `open()` in Python 3. See my answer below: https://stackoverflow.com/a/69962200/4054971 — Pieter, Nov 14 '21 at 10:38

pyano · Accepted Answer · 2016-12-06T09:53:36.853

36

There is a solution for Python 3.5: you need pdfminer.six. Under win10 I could easy install it with

pip install pdfminer.six

You can check the installed version with

pdfminer.__version__

I haven't tested it intensively yet. But I could run the following code for the conversion pdf→text and pdf→html

edited Dec 06 '16 at 09:53

answered Nov 29 '16 at 22:43

pyano

1,885
10
28

also, pdfminer.six seems maintained up to november 18! hooray. – benzkji Jun 20 '19 at 07:45
@pyano You mention "could run the following code for the conversion pdf→text and pdf→html" BUT there is NO code following. Working code example in May 2020 is here: https://stackoverflow.com/a/61857301/7483211 – Cornelius Roemer May 17 '20 at 19:23
@Cornelius Roemer The code is just below in the next answer: Improved solution (Dez 2016) – pyano Jun 10 '20 at 15:48

score 13 · Answer 2 · edited Feb 20 '20 at 07:26

Improved solution (Dez 2016)

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import HTMLConverter,TextConverter,XMLConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import io

def convert(case,fname, pages=None):
    if not pages: pagenums = set();
    else:         pagenums = set(pages);      
    manager = PDFResourceManager() 
    codec = 'utf-8'
    caching = True

    if case == 'text' :
        output = io.StringIO()
        converter = TextConverter(manager, output, codec=codec, laparams=LAParams())     
    if case == 'HTML' :
        output = io.BytesIO()
        converter = HTMLConverter(manager, output, codec=codec, laparams=LAParams())

    interpreter = PDFPageInterpreter(manager, converter)   
    infile = open(fname, 'rb')

    for page in PDFPage.get_pages(infile, pagenums,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    convertedPDF = output.getvalue()  

    infile.close(); converter.close(); output.close()
    return convertedPDF

#//////////// main ///////////////////////
filePDF  = 'myDir//myPDF.pdf'     # input
fileHTML = 'myDir//myHTML.html'   # output
fileTXT  = 'myDir//myTXT.txt'     # output

case = "HTML"

if case == 'HTML' :
    convertedPDF = convert('HTML', filePDF, pages=[0,1])
    fileConverted = open(fileHTML, "wb", encoding="utf-8")
if case == 'text' :
    convertedPDF = convert('text', filePDF, pages=[0,1])
    fileConverted = open(fileTXT, "w", encoding="utf-8")

fileConverted.write(convertedPDF)
fileConverted.close()
#print(convertedPDF)

score 2 · Answer 3 · answered Oct 14 '18 at 13:56

In my case on Python 3.7 I tried using it and it worked like a charm for me!

here is the code I used:

def convert_pdf_to_txt(path_to_file):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = open(path_to_file, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

which package did you install? the classis pdfminer states it doesnt support python3? https://euske.github.io/pdfminer/index.html — benzkji, Jun 20 '19 at 07:42
@benzkji You are right you need to install pdfminer.six for Python3. More details here https://github.com/pdfminer/pdfminer.six — Muhammad Haseeb, Jun 20 '19 at 14:56

score 0 · Answer 4 · answered Nov 14 '21 at 10:38

0

The function file() was a built-in function in Python 2.7. But it is not a built-in function in Python 3.5 anymore.

You should change file() into open().

answered Nov 14 '21 at 10:38

Pieter

3,262
1
17
27

score -2 · Answer 5 · answered Nov 11 '16 at 14:58

-2

pdfminer doesn't support python version 3.5. It works only in Python 2.6 or newer. I faced the same issue try using python version 2.6 it will solve your problem.

answered Nov 11 '16 at 14:58

animal

994
3
13
35

I strongly recommend to use Python 3.8+ in 2023. This answer was written in 2016 and is outdated. – Martin Thoma Mar 22 '23 at 09:03

Pdfminer python 3.5

5 Answers5

Linked