How can I copy all PDF pages in a TXT file in python?

Question

I have written the following script, in order to extract the text of a PDF file into plain text and save it into a TXT file:

import PyPDF2

def pdfToTxt(pdfFile):
   pdfFileObject = open(pdfFile, 'rb')
   pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
   numberOfPages = pdfReader.numPages

   tempFile = open(r"temp.txt","a")

   for p in range(numberOfPages):
      pagesObject = pdfReader.getPage(p)
      text = pagesObject.extractText()
      tempFile.writelines(text)

   tempFile.close()

pdfToTxt("PdfFile.pdf")

The code works fine for the first 15 pages, which are successfully written in temp.txt file, but after the 15th page I get the following error:

    Traceback (most recent call last):
  File "PdfToTextExtractor.py", line 35, in <module>
    pdfToTxt("PdfFile.pdf")
  File "PdfToTextExtractor.py", line 30, in pdfToTxt
    tempFile.writelines(text)
  File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 0: characte
r maps to <undefined>

It seems that the character '\ufb01' is the problem.

In case you have any idea how to overcome this issue, please let me know.

Does this answer your question? [python 3.2 UnicodeEncodeError: 'charmap' codec can't encode character '\u2013' in position 9629: character maps to ](https://stackoverflow.com/questions/16346914/python-3-2-unicodeencodeerror-charmap-codec-cant-encode-character-u2013-i) — Joe, Jun 16 '20 at 12:20

score 0 · Answer 1 · answered Jun 16 '20 at 11:49

In order to overcome this issue, you have to replace the character with another one (let's say a white space), before you write it into the file.

In that case you have to add the following line in the for loop:

text = text.replace('\ufb01', " ")

the method should look like this:

def pdfToTxt(pdfFile):
   pdfFileObject = open(pdfFile, 'rb')
   pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
   numberOfPages = pdfReader.numPages

   tempFile = open(r"temp.txt","a")

   for p in range(numberOfPages):
      pagesObject = pdfReader.getPage(p)
      text = pagesObject.extractText()
      text = text.replace('\ufb01', " ")
      tempFile.writelines(text)

   tempFile.close()

score 0 · Answer 2 · answered Jun 16 '20 at 11:50

0

When opening your tempFile, set the encoding like so:

tempFile = open(r"temp.txt","a", encoding='utf-8')

answered Jun 16 '20 at 11:50

Lucan

2,907
2
16
30

score 0 · Answer 3 · answered Jun 16 '20 at 11:52

The issue is in the way you open file, so replace

tempFile = open(r"temp.txt","a")

With the same open + extra param:

tempFile = open(r"temp.txt","a", encoding="utf-8")

Additionally, I suggest you to use context manager in case of any file operations, which ensures that file will be closed correctly in case of unexpected exception:

with open(r"temp.txt","a") as tempFile:
    ...

Also, if you do so, you can remove file closing after for loop.

How can I copy all PDF pages in a TXT file in python?

3 Answers3