Iterate through .PDFs and convert them to .txt using PDFMiner

Question

I'm trying to merge two different things I've been able to accomplish independently. Unfortunately the PDFMiner docs are just not useful at all.

I have a folder that has hundred of PDFs, named: "[0-9].pdf", in it, in no particular order and I don't care to sort them. I just need a way to go through them and convert them to text.

Using this post: Extracting text from a PDF file using PDFMiner in python? - I was able to extract the text from one PDF successfully.

Some of this post: batch process text to csv using python - was useful in determining how to open a folder full of PDFs and work with them.

Now, I just don't know how I can combine them to one-by-one open a PDF, convert it to a text object, save that to a text file with the same original-filename.txt, and then move onto the next PDF in the directory.

Here's my code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import os
import glob

directory = r'./Documents/003/' #path
pdfFiles = glob.glob(os.path.join(directory, '*.pdf'))

resourceManager = PDFResourceManager()
returnString = StringIO()
codec = 'utf-8'
laParams = LAParams()
device = TextConverter(resourceManager, returnString, codec=codec, laparams=laParams)
interpreter = PDFPageInterpreter(resourceManager, device)

password = ""
maxPages = 0
caching = True
pageNums=set()

for one_pdf in pdfFiles:
    print("Processing file: " + str(one_pdf))
    fp = file(one_pdf, 'rb')
    for page in PDFPage.get_pages(fp, pageNums, maxpages=maxPages, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
    text = returnString.getvalue()
    filenameString = str(one_pdf) + ".txt"
    text_file = open(filenameString, "w")
    text_file.write(text)
    text_file.close()
    fp.close()

device.close()
returnString.close()

I get no compilation errors, but my code doesn't do anything.

Thanks for your help!

I think `pdfFiles` is empty because I didn't see anything. But why would that be? @LaurentLAPORTE @stovfl — kabaname, May 09 '17 at 19:11
Your directory `directory = r'./Documents/003/'` doesn't exist: it is a relative path, so the result depends on where your are in your directory tree when you invoke your Python program. Use an absolute path. — Laurent LAPORTE, May 09 '17 at 19:14
That worked! I used `os.path.abspath("../Documents/003/")` and that worked. Thanks!! @LaurentLAPORTE — kabaname, May 09 '17 at 19:27

score 1 · Answer 1 · answered May 09 '17 at 19:28

1

Just answering my own question with the solution idea from @LaurentLAPORTE that worked.

Set directory to an absolute path using os like this: os.path.abspath("../Documents/003/"). And then it'll work.

answered May 09 '17 at 19:28

kabaname

265
1
12

Iterate through .PDFs and convert them to .txt using PDFMiner

1 Answers1