0

i have a problem with my code iv tryed everything and still nothing so i thought id come to this community and try and get answers

def parse_html(filename):
    """Extract the Author, Title and Text from a HTML file
    which was produced by pdftotext with the option -htmlmeta."""

The parse_html function returns a dictionary consisting of the contents of some of the fields in our index schema

def pdftotext(pdf):
    """ this code is very long so im going to post only where the 
    error occures"""

    data = parse_html(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
    with open(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data1', basename + '.txt'), 'w') as outfile:
        outfile.write(data ['text'])
    return data

there is more data after outfile.write and that is okay. i am trying to Plug the function parse_html into the pdftotext function and then Write the contents of the text field to a .txt file and i get this error

   <ipython-input-7-dc9e4ae8fd27> in pdftotext(pdf)
 37     data = parse_html(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
 38     with open(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data1', basename + '.txt'), 'w') as outfile:
 ---> 39         outfile.write(data ['text'])    <----------- this is the error
 40 
 41         os.remove(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data',  basename + '.html'))

 UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 108: ordinal not in range(128)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
yobra89
  • 19
  • 6

0 Answers0