I have written the following script, in order to extract the text of a PDF file into plain text and save it into a TXT file:
import PyPDF2
def pdfToTxt(pdfFile):
pdfFileObject = open(pdfFile, 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
numberOfPages = pdfReader.numPages
tempFile = open(r"temp.txt","a")
for p in range(numberOfPages):
pagesObject = pdfReader.getPage(p)
text = pagesObject.extractText()
tempFile.writelines(text)
tempFile.close()
pdfToTxt("PdfFile.pdf")
The code works fine for the first 15 pages, which are successfully written in temp.txt
file, but after the 15th page I get the following error:
Traceback (most recent call last):
File "PdfToTextExtractor.py", line 35, in <module>
pdfToTxt("PdfFile.pdf")
File "PdfToTextExtractor.py", line 30, in pdfToTxt
tempFile.writelines(text)
File "C:\ProgramData\Anaconda3\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb01' in position 0: characte
r maps to <undefined>
It seems that the character '\ufb01' is the problem.
In case you have any idea how to overcome this issue, please let me know.