Well i used with success PDFMiner, with which you can parse and extract text from pdf documents.
More specifically there is this pdf2txt.py
module where you can use to extract text. Installation is easy: pdfminer-xxx#python setup.py install
and from bash or cmd a simple pdf2txt.py -o Application.txt Reference/Application.pdf
command would do the trick.
In the above mentioned oneliner application.pdf
is ur target pdf, the one you are going to process and application.txt
is the file that will be generated.
Furthermore for more complex tasks you can take a look at the api and modify it up to your needs.
edit: i answered based on my personal experience and that's that. I have no reason to "promote" the proposed tool. I hope that helps
edit2: something like that worked for me.
# -*- coding: utf-8 -*-
import os
import re
dirpath = 'path\\to\\dir'
filenames = os.listdir(dirpath)
nb = 0
open('path\\to\\dir\\file.txt', 'w') as outfile:
for fname in filenames:
nb = nb+1
print fname
print nb
currentfile = os.path.join(dirpath, fname)
open(currentfile) as infile:
for line in infile:
outfile.write(line)