How to add more tolerance for whitespaces in PyPDF2?

Question

I'm looking for the easiest way to convert PDF to plain text in Python.

PyPDF2 seemed to be very easy, here is what I have:

def test_pdf(filename):
import PyPDF2
pdf = PyPDF2.PdfFileReader(open(filename, "rb"))
for page in pdf.pages:
    print page.extractText()

But it gives me:

InChapter5wepresentandevaluateourresults,togetherwiththetestenvironment.

How can I extract words from that PDF with PyPDF? Is there a different way (another library that works well for this)?

score 0 · Answer 1 · 2014-02-10T16:47:36.317

Well i used with success PDFMiner, with which you can parse and extract text from pdf documents. More specifically there is this pdf2txt.py module where you can use to extract text. Installation is easy: pdfminer-xxx#python setup.py install and from bash or cmd a simple pdf2txt.py -o Application.txt Reference/Application.pdf command would do the trick. In the above mentioned oneliner application.pdf is ur target pdf, the one you are going to process and application.txt is the file that will be generated. Furthermore for more complex tasks you can take a look at the api and modify it up to your needs.

edit: i answered based on my personal experience and that's that. I have no reason to "promote" the proposed tool. I hope that helps

edit2: something like that worked for me.

# -*- coding: utf-8 -*-
import os
import re

dirpath = 'path\\to\\dir'
filenames = os.listdir(dirpath)
nb = 0

open('path\\to\\dir\\file.txt', 'w') as outfile:
    for fname in filenames:
        nb = nb+1
        print fname
        print nb
        currentfile = os.path.join(dirpath, fname)

open(currentfile) as infile:
    for line in infile:
        outfile.write(line)

Thanks for your recommendation. I tried PDFMiner's API and got stuck with weird Character Item Objects. Using the command line tool within a subprocess call feels a bit strange, would that be a good way to go? — kadrian, Feb 10 '14 at 07:42
@kadrian i edited my answer to reflect something more generic. say for example you need to parse a file which lies within a directory along with other pdf files. or at some point you may need to parse them all one by one and extract text. with some slight modifications you may cover your needs. in any case if you have any questions feel free to ask. — , Feb 10 '14 at 16:51

How to add more tolerance for whitespaces in PyPDF2?

1 Answers1