i have a problem with my code iv tryed everything and still nothing so i thought id come to this community and try and get answers
def parse_html(filename):
"""Extract the Author, Title and Text from a HTML file
which was produced by pdftotext with the option -htmlmeta."""
The parse_html function returns a dictionary consisting of the contents of some of the fields in our index schema
def pdftotext(pdf):
""" this code is very long so im going to post only where the
error occures"""
data = parse_html(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
with open(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data1', basename + '.txt'), 'w') as outfile:
outfile.write(data ['text'])
return data
there is more data after outfile.write and that is okay. i am trying to Plug the function parse_html into the pdftotext function and then Write the contents of the text field to a .txt file and i get this error
<ipython-input-7-dc9e4ae8fd27> in pdftotext(pdf)
37 data = parse_html(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
38 with open(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data1', basename + '.txt'), 'w') as outfile:
---> 39 outfile.write(data ['text']) <----------- this is the error
40
41 os.remove(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 108: ordinal not in range(128)