How to create a text file from pdf using Python?

Question

I am trying to write a block of code that does this: it first extracts text from a pdf and then creates a text file with the content in it. This is what I wrote:

import os
import pyPdf
import re

##function that extracts text from pdf
def pdfcontent(filename):
    ct = ""
    pdf = pyPdf.PdfFileReader(file(filename,"rb"))
    for i in range(0,pdf.getNumPages()):
        ct += pdf.getPage(i).extractText() + "\n"
    return ct

##funcion that generates a txt file from a pdf
def pdftotxt(filename):
    ##first, convert pdf to txt
    pdfct = pdfcontent(filename)
    ##fix filename problem
    newfn = re.sub(".pdf", "", filename)
    #now generate txt
    fo = open(r'C:\Users\xxx\PycharmProjects\untitled\decisiontxt\' + newfn + ".txt","wb")
    fo.write(pdfct)
    fo.close()

pdftotxt("PDFfromDocumentum.pdf")

EDIT: I fixed my previous problems and then another problem came up:

File "C:/Users/xxx/PycharmProjects/untitled/fdsa", line 22
fo = open(r'C:\Users\xxx\PycharmProjects\untitled\decisiontxt\' + newfn + ".txt","wb")
                                                                                      ^
SyntaxError: EOL while scanning string literal

It seems to me that Python took

fo = open(r'C:\Users\xxx\PycharmProjects\untitled\decisiontxt\' + newfn + ".txt","wb")

as a string instead of a command. What's the solution to this problem?

Which file/directory doesn't exist? Are you sure it's not the filename you feed to `PdfFileReader`? Please post the actual traceback. — wflynny, Jul 15 '14 at 19:32
Seems you are able to solve your problems within not very long time frame. That's very good, and good luck, but people on the internet are probably not interested in a live report of your programming struggle. Please consider posting a question when you are really stuck (and unable to find the solution on SO), instead of editing it every few minutes with your latest achievements... — BartoszKP, Jul 15 '14 at 19:40
Duplicate of http://stackoverflow.com/questions/2870730/python-raw-strings-and-trailing-backslash — BartoszKP, Jul 15 '14 at 19:46

score 0 · Answer 1 · answered Jul 15 '14 at 19:41

If you want your script to create a new file if it does not exist use "wb" as the mode.

Refer to this for more information on using file modes.

EDIT ( Based on your edit )

The reason why you are getting EOL while parsing is that you are escaping the closing aphostrophe \' . Use backslash to escape the backslash preceding the apostrophe. I.E \\'

score 0 · Answer 2 · answered Jul 15 '14 at 19:44

0

Despite you're using raw string you should escape last \

open(r'C:\Users\xxx\PycharmProjects\untitled\decisiontxt\\' + newfn + ".txt","wb")

see Python raw strings and trailing backslash for details

answered Jul 15 '14 at 19:44

RomanHotsiy

4,978
1
25
36

How to create a text file from pdf using Python?

2 Answers2