python ascii vs unicode (utf-8)

Question

i have a problem with my code iv tryed everything and still nothing so i thought id come to this community and try and get answers

def parse_html(filename):
    """Extract the Author, Title and Text from a HTML file
    which was produced by pdftotext with the option -htmlmeta."""

The parse_html function returns a dictionary consisting of the contents of some of the fields in our index schema

def pdftotext(pdf):
    """ this code is very long so im going to post only where the 
    error occures"""

    data = parse_html(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
    with open(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data1', basename + '.txt'), 'w') as outfile:
        outfile.write(data ['text'])
    return data

there is more data after outfile.write and that is okay. i am trying to Plug the function parse_html into the pdftotext function and then Write the contents of the text field to a .txt file and i get this error

   <ipython-input-7-dc9e4ae8fd27> in pdftotext(pdf)
 37     data = parse_html(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data', basename + '.html'))
 38     with open(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data1', basename + '.txt'), 'w') as outfile:
 ---> 39         outfile.write(data ['text'])    <----------- this is the error
 40 
 41         os.remove(os.path.join(u'/home/brianyobra/Desktop/brian/a.i builds/a.i dev/NLP OUTCOME/data',  basename + '.html'))

 UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 108: ordinal not in range(128)

`import io; with io.open(file_path, 'w', encoding='utf-8') as outfile:` — Burhan Khalid, Oct 12 '17 at 07:06
You are writing Unicode data to a file object that only takes encoded data. Encode explicitly or use a file object that handles encoding for you. — Martijn Pieters, Oct 12 '17 at 07:09
thanks burhan.. but when i run it i dont get an output ?? any ideas or should i send u the whole code ?? — yobra89, Oct 12 '17 at 07:19
thanks martijn .. could u kindly elaborate your answer its been long since i dealt with encoding in python .. thanks — yobra89, Oct 12 '17 at 07:23

python ascii vs unicode (utf-8)

0 Answers0