Goal
To convert a PDF file that has some arabic text within it into a utf-8 txt file in Python using PyPDF.
Code
What I have tried:
import pyPdf
import codecs
input_filepath = "hans_wehr_searchable_pdf.pdf"#pdf file path
output_filepath = "output.txt"#output text file path
output_file = open(output_filepath, "wb")#open output file
pdf = pyPdf.PdfFileReader(open(input_filepath, "rb"))#read PDF
for page in pdf.pages:#loop through pages
page_text = page.extractText()#get text from page
page_text = page_text.decode(encoding='utf-8')#decode
output_file.write(page_text)#write to file
output_file.close()#close
Error
I however receive this error:
Traceback (most recent call last):
File "pdf2txt.py", line 9, in <module>
page_text = page_text.decode(encoding='windows-1256')#decode
File "/usr/lib/python2.7/encodings/cp1256.py", line 15, in decode
return codecs.charmap_decode(input,errors,decoding_table)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2122' in position 98: ordinal not in range(128)