3

I'm trying to merge two different things I've been able to accomplish independently. Unfortunately the PDFMiner docs are just not useful at all.

I have a folder that has hundred of PDFs, named: "[0-9].pdf", in it, in no particular order and I don't care to sort them. I just need a way to go through them and convert them to text.

Using this post: Extracting text from a PDF file using PDFMiner in python? - I was able to extract the text from one PDF successfully.

Some of this post: batch process text to csv using python - was useful in determining how to open a folder full of PDFs and work with them.

Now, I just don't know how I can combine them to one-by-one open a PDF, convert it to a text object, save that to a text file with the same original-filename.txt, and then move onto the next PDF in the directory.

Here's my code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import os
import glob

directory = r'./Documents/003/' #path
pdfFiles = glob.glob(os.path.join(directory, '*.pdf'))

resourceManager = PDFResourceManager()
returnString = StringIO()
codec = 'utf-8'
laParams = LAParams()
device = TextConverter(resourceManager, returnString, codec=codec, laparams=laParams)
interpreter = PDFPageInterpreter(resourceManager, device)

password = ""
maxPages = 0
caching = True
pageNums=set()

for one_pdf in pdfFiles:
    print("Processing file: " + str(one_pdf))
    fp = file(one_pdf, 'rb')
    for page in PDFPage.get_pages(fp, pageNums, maxpages=maxPages, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
    text = returnString.getvalue()
    filenameString = str(one_pdf) + ".txt"
    text_file = open(filenameString, "w")
    text_file.write(text)
    text_file.close()
    fp.close()

device.close()
returnString.close()

I get no compilation errors, but my code doesn't do anything.

Thanks for your help!

kabaname
  • 265
  • 1
  • 12

1 Answers1

1

Just answering my own question with the solution idea from @LaurentLAPORTE that worked.

Set directory to an absolute path using os like this: os.path.abspath("../Documents/003/"). And then it'll work.

kabaname
  • 265
  • 1
  • 12