38

I have this code:

    printinfo = title + "\t" + old_vendor_id + "\t" + apple_id + '\n'
    # Write file
    f.write (printinfo + '\n')

But I get this error when running it:

    f.write(printinfo + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 7: ordinal not in range(128)

It's having toruble writing out this:

Identité secrète (Abduction) [VF]

Any ideas please, not sure how to fix.

Cheers.

UPDATE: This is the bulk of my code, so you can see what I am doing:

def runLookupEdit(self, event):
    newpath1 = pathindir + "/"
    errorFileOut = newpath1 + "REPORT.csv"
    f = open(errorFileOut, 'w')

global old_vendor_id

for old_vendor_id in vendorIdsIn.splitlines():
    writeErrorFile = 0
    from lxml import etree
    parser = etree.XMLParser(remove_blank_text=True) # makes pretty print work

    path1 = os.path.join(pathindir, old_vendor_id)
    path2 = path1 + ".itmsp"
    path3 = os.path.join(path2, 'metadata.xml')

    # Open and parse the xml file
    cantFindError = 0
    try:
        with open(path3): pass
    except IOError:
        cantFindError = 1
        errorMessage = old_vendor_id
        self.Error(errorMessage)
        break
    tree = etree.parse(path3, parser)
    root = tree.getroot()

    for element in tree.xpath('//video/title'):
        title = element.text
        while '\n' in title:
            title= title.replace('\n', ' ')
        while '\t' in title:
            title = title.replace('\t', ' ')
        while '  ' in title:
            title = title.replace('  ', ' ')
        title = title.strip()
        element.text = title
    print title

#########################################
######## REMOVE UNWANTED TAGS ########
#########################################

    # Remove the comment tags
    comments = tree.xpath('//comment()')
    q = 1
    for c in comments:
        p = c.getparent()
        if q == 3:
            apple_id = c.text
        p.remove(c)
        q = q+1

    apple_id = apple_id.split(':',1)[1]
    apple_id = apple_id.strip()
    printinfo = title + "\t" + old_vendor_id + "\t" + apple_id

    # Write file
    # f.write (printinfo + '\n')
    f.write(printinfo.encode('utf8') + '\n')
f.close()
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
speedyrazor
  • 3,127
  • 7
  • 33
  • 51
  • 6
    If you look at the right side of the question, you will notice a column of "Related" questions. I suggest you start by looking at them. You would also have gotten a list of possibly duplicates when writing your question title. – Some programmer dude Nov 07 '13 at 10:26
  • @MartijnPieters: you are right, as usual. Comment erased. – cdarke Nov 07 '13 at 10:37

1 Answers1

73

You need to encode Unicode explicitly before writing to a file, otherwise Python does it for you with the default ASCII codec.

Pick an encoding and stick with it:

f.write(printinfo.encode('utf8') + '\n')

or use io.open() to create a file object that'll encode for you as you write to the file:

import io

f = io.open(filename, 'w', encoding='utf8')

You may want to read:

before continuing.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Using f.write(printinfo.encode('utf8') + '\n') works but creates odd characters Identit√© secr√®te (Abduction) [VF] which should be accented Identité secrète (Abduction) [VF] – speedyrazor Nov 07 '13 at 10:46
  • @speedyrazor: please *do* read the links I provided. You are opening a UTF-8 file with something that displays the bytes as a different encoding instead. Pick the right encoding for your application. – Martijn Pieters Nov 07 '13 at 10:51
  • @Martin Pieters: I have had a read through, but don't really understand. If I have "Identité secrète" in my XML file I am reading, I pick lines out and write them to a file, but that line comes out as "Identit√© secr√®te". Sorry to ask, but what code would sort this out please? – speedyrazor Nov 07 '13 at 11:12
  • @speedyrazor: Your XML file uses a codec too. It either uses UTF-8 or has a different codec specified on the first line of the XML file. The XML parser then decodes that data to a Unicode value. When writing out the values to a file, you need to pick a codec again to write bytes. I picked UTF-8 for you because that codec can encode all of unicode, but whatever you used to view the resulting file used a **different** codec to interpret the bytes. The `é` character is unicode codepoint U+00E9. UTF-8 encodes that to two bytes, hex C3 and A9. Misinterpreting those two bytes gives you `√©`. – Martijn Pieters Nov 07 '13 at 12:15
  • @speedyrazor: without knowing how you are *reading* the produces file again, I cannot help you further. – Martijn Pieters Nov 07 '13 at 12:48
  • @speedyrazor: I didn't ask for how you produced the Unicode values, I asked how you were reading the file written by the script. – Martijn Pieters Nov 07 '13 at 13:30
  • Sorry, I have updated my original question to include the bulk of the code. – speedyrazor Nov 07 '13 at 14:25
  • @speedyrazor: You still haven't told me how you are opening `REPORT.csv` after the script completes. You are reading that file somewhere and **there** you see `Identit√© secr√®te`. It is **that program** that is not interpreting the written data correctly. Your Python code is working as intended. – Martijn Pieters Nov 07 '13 at 14:31
  • aahhh, penny finaly dropped. Open Office was incorrectly displaying those characters. – speedyrazor Nov 07 '13 at 14:52
  • @speedyrazor: exactly. Excel is often worse for reading CSV encoded with anything other than the current locale setting. Imagine producing CSVs for a customer with a Norwegian codepage, who then share the same file with someone in Eastern Europe using a different Windows codepage. – Martijn Pieters Nov 07 '13 at 14:55
  • I have the error message: UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' But this answer did not resolve the error – Lawrence DeSouza Oct 26 '14 at 00:57
  • @LawrenceDeSouza: there are lots of ways you can trigger that error message. The solution in this answer is addressed to the specific way the OP triggered it. E.g. if you are trying to write a Unicode string object to a Python 2 file object, you'll have to encode first. – Martijn Pieters Oct 26 '14 at 01:06
  • @LawrenceDeSouza: I'm sorry that it did not solve *your* problem, but you need to first determine if your problem is actually the same. – Martijn Pieters Oct 26 '14 at 01:08
  • "You need to encode Unicode explicitly before writing to a file, otherwise Python does it for you with the default ASCII codec." that's the important bit I needed. – jmoz Jul 21 '15 at 14:24
  • This solved my problem ,`# encoding=utf8 import sys reload(sys) sys.setdefaultencoding('utf8')` – Joseph Daudi Jun 23 '18 at 17:53