I want to extract text from a PDF to a .text file using PDFminer. I have found the code but I have no idea how to use it

Question

This is the code I found somewhere here. I have no idea how to use it. Can someone walk me through this and help me convert a sample pdf?

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,   password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text

and im assuming you extracted the code from here? https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167 — glls, May 21 '16 at 21:37
I got it from here: http://stackoverflow.com/questions/26494211/extracting-text-from-a-pdf-file-using-pdfminer-in-python/26495057#26495057 — iMiner, May 21 '16 at 21:39
Quick additional comment - I'm using the Python 3 fork of PDFminer - [pdfminer.six](https://github.com/goulu/pdfminer "pdfminer.six") - and to use this code I found I hade to substitute open() for file() in the convert() function. — Tom Wagstaff, Dec 02 '16 at 18:32

glls · Accepted Answer · 2016-05-21T22:21:24.650

4

If you use pdfminer and use the code from their page and read their documentation https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167:

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

i dont think you should have any trouble using:

def convert(fname, pages=None): which basically converts the pdf for you

use as follows:

some_variable = convert("filename.pdf") 
print(some_variable)
#do something with your variable

using your example pdf:

edited May 21 '16 at 22:21

answered May 21 '16 at 21:46

glls

2,325
1
22
39

That works...kinda. This was the output: ThisÂ isÂ pdfÂ The original PDF said "This is pdf" but python displays "ThisÂ isÂ pdfÂ " – iMiner May 21 '16 at 21:53
is the pdf public, as in, are you able to share it? – glls May 21 '16 at 22:05
https://drive.google.com/file/d/0B5eGq9boXZxARWJLX0pDb1RaX2s/view?usp=sharing its on my google drive. I think since I have shared it you can download it. – iMiner May 21 '16 at 22:15
i do not reproduce your error, view recently added screenshot in my answer i event exported to a txt file and everything looks goof – glls May 21 '16 at 22:20
Its fine now. It just bugged out for some weird reason before. Thanks! – iMiner May 21 '16 at 22:23

score -1 · Answer 2 · answered Apr 24 '17 at 07:05

finally I found a way to this. The best library is the PDfminer with little modification in pdf2txt.py to effective usage. pdf2text.py is located in pdfminer/tools

to install PDfminer use on terminal

pip install PDfminer

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import re

def convert(fname):
    pages=None
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    print text 

    # write Content to .txt
    text_file = open("Output_1.txt", "w")
    text = re.sub("\s\s+", " ", text)
    text_file.write("%s" % text)
    text_file.close()

convert("xyz.pdf")

Sorry but this doesn't tell us anything, and it isn't a reusable resource for other users on SO, I'm voting to close it. Can you go back and edit the question to state what the actual problem with the boilerplate code was, and how you fixed it? Also please make that clear in your answer. We're not going to do a diff between that code and this code. — smci, Apr 05 '19 at 10:45

I want to extract text from a PDF to a .text file using PDFminer. I have found the code but I have no idea how to use it

2 Answers2