0

Goal

To convert a PDF file that has some arabic text within it into a utf-8 txt file in Python using PyPDF.

Code

What I have tried:

import pyPdf
import codecs
input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
output_filepath = "output.txt"#output text file path
output_file = open(output_filepath, "wb")#open output file
pdf = pyPdf.PdfFileReader(open(input_filepath, "rb"))#read PDF
for page in pdf.pages:#loop through pages
    page_text = page.extractText()#get text from page
    page_text = page_text.decode(encoding='utf-8')#decode 
    output_file.write(page_text)#write to file
output_file.close()#close

Error

I however receive this error:

Traceback (most recent call last):
  File "pdf2txt.py", line 9, in <module>
    page_text = page_text.decode(encoding='windows-1256')#decode 
  File "/usr/lib/python2.7/encodings/cp1256.py", line 15, in decode
    return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 98: ordinal not in range(128)
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • Possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) – Random Davis May 24 '16 at 17:27
  • 1
    How about you extract just the code and data causing the issue and build the usual minimal but complete example with them? Consider for example whether it's relevant that the data came from a PDF or not. Also, consider upgrading to a recent Python version. – Ulrich Eckhardt May 24 '16 at 17:41

1 Answers1

3

Instead of opening the file using the built in python open you could try to open the file using codecs and specifying the encoding of the file when opening, which it looks like you already imported codecs. Your code would change to:

import pyPdf
import codecs
input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
output_filepath = "output.txt"#output text file path
output_file = open(output_filepath, "wb")#open output file
pdf = pyPdf.PdfFileReader(codecs.open(input_filepath, "rb", encoding='utf-8'))#read PDF
for page in pdf.pages:#loop through pages
    page_text = page.extractText()#get text from page
    page_text = page_text.decode(encoding='utf-8')#decode 
    output_file.write(page_text)#write to file
output_file.close()#close
Cory Shay
  • 1,204
  • 8
  • 12