-4

This is the code I found somewhere here. I have no idea how to use it. Can someone walk me through this and help me convert a sample pdf?

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,   password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
iMiner
  • 43
  • 1
  • 1
  • 6

2 Answers2

4

If you use pdfminer and use the code from their page and read their documentation https://www.binpress.com/tutorial/manipulating-pdfs-with-python/167:

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage

def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

i dont think you should have any trouble using:

def convert(fname, pages=None): which basically converts the pdf for you

use as follows:

some_variable = convert("filename.pdf") 
print(some_variable)
#do something with your variable

using your example pdf: enter image description here

glls
  • 2,325
  • 1
  • 22
  • 39
  • That works...kinda. This was the output: This is pdf  The original PDF said "This is pdf" but python displays "This is pdf " – iMiner May 21 '16 at 21:53
  • is the pdf public, as in, are you able to share it? – glls May 21 '16 at 22:05
  • https://drive.google.com/file/d/0B5eGq9boXZxARWJLX0pDb1RaX2s/view?usp=sharing its on my google drive. I think since I have shared it you can download it. – iMiner May 21 '16 at 22:15
  • i do not reproduce your error, view recently added screenshot in my answer i event exported to a txt file and everything looks goof – glls May 21 '16 at 22:20
  • Its fine now. It just bugged out for some weird reason before. Thanks! – iMiner May 21 '16 at 22:23
-1

finally I found a way to this. The best library is the PDfminer with little modification in pdf2txt.py to effective usage. pdf2text.py is located in pdfminer/tools

to install PDfminer use on terminal

pip install PDfminer

from cStringIO import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import re

def convert(fname):
    pages=None
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = file(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    print text 

    # write Content to .txt
    text_file = open("Output_1.txt", "w")
    text = re.sub("\s\s+", " ", text)
    text_file.write("%s" % text)
    text_file.close()

convert("xyz.pdf")
user3732708
  • 623
  • 1
  • 9
  • 20
  • Sorry but this doesn't tell us anything, and it isn't a reusable resource for other users on SO, I'm voting to close it. Can you go back and edit the question to state what the actual problem with the boilerplate code was, and how you fixed it? Also please make that clear in your answer. We're not going to do a diff between that code and this code. – smci Apr 05 '19 at 10:45