Creating .txt files from pdf files

Question

Right now I'm writing a program in python that requires you to open a certain .pdf file, control+A (to select all), control C and control V (to copy and paste) on a .txt file, and then run the program.

I was wondering if there's any way I can skip a step and run the program without having to do this sequence of steps, with just a reference to the pdf file inside the program.

Something like:

##does the procedure above and saves it on a notes.txt file##
FILE_NAME = 'notes.pdf'
read_pdf(FILE_NAME,'notes.txt')

Try the code here maybe: http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/ — TYZ, Feb 13 '14 at 19:48
There are certain utilities such as `pdftotext`. You might want to explorer those. — devnull, Feb 13 '14 at 19:48
+1 for `pdftotext`. It's very convenient. You're likely going to have to do some preprocessing on the text, though (in particular if the text contains non-ascii characters). — michaelmeyer, Feb 13 '14 at 19:56
http://stackoverflow.com/questions/25665/python-module-for-converting-pdf-to-text — Omid Raha, Feb 13 '14 at 20:48

Omid Raha · Answer 1 · 2014-02-13T21:59:43.337

Use slate module, it depends on the pdfminer.

To install it:

pip install pdfminer==20131113
pip install https://codeload.github.com/timClicks/slate/zip/master

To use it:

import slate

with open('example.pdf') as fp:
    doc = slate.PDF(fp)

print(len(doc))
print(doc[0])

4
This is a test.

Notes:

The pdfminer module don't support Python 3.
You need to install slate from master repo, because pypi version of slate is old, and is not compatible with last change of pdfminer.

Or use PyPDF2 :

To install it:

pip install PyPDF2

To use it:

import PyPDF2

pdf = PyPDF2.PdfFileReader(open('sample.pdf', "rb"))

print(pdf.getNumPages())
print(pdf.getPage(0).extractText())

1
This is a sample.

score 1 · Answer 2 · answered Feb 13 '14 at 20:25

There are several ways and many utilities you can use to do that step automatically.

There is a module for Python on Windows that does GUI automation: pywinauto, but it's Windows only.

You can use a pure python library like PyPDF2 which has an extractText function. Or PDFMiner.

The poppler library has also its python bindings and can be used to extract text pretty much like PyPDF2.

You can call external programs from python like pdftotext from Xpdf.

Creating .txt files from pdf files

2 Answers2