7

I have done some research and seen solutions but none have worked for me.

Python - 'ascii' codec can't decode byte

This didn't work for me. And I know the 0xe9 is the é character. But I still can't figure out how to get this working, here is my code

output_lines = ['<menu>', '<day name="monday">', '<meal name="BREAKFAST">', '<counter name="Entreé">', '<dish>', '<name icon1="Vegan" icon2="Mindful Item">', 'Cream of Wheat (Farina)','</name>', '</dish>', '</counter >', '</meal >', '</day >', '</menu >']
output_string = '\n'.join([line.encode("utf-8") for line in output_lines])

And this give me the error ascii codec cant decode byte 0xe9

And I have tried decoding, I have tried to replace the "é" but can't seem to get that to work either.

Community
  • 1
  • 1
iqueqiorio
  • 1,149
  • 2
  • 35
  • 78
  • 2
    Your code sample is invalid and won't reproduce the issue; `output_lines` is empty so your loop won't do anything. Your error indicates you have a **decoding** error while encoding, this usually indicates you are trying to encode data that is **already** encoded. – Martijn Pieters Mar 09 '15 at 17:00
  • @MartijnPieters sorry I didn't show it was full in my sample code but it is filled. I will add that to the question – iqueqiorio Mar 09 '15 at 17:02
  • 1
    this is still not your actual `output_lines` ... surely ... can you `print output_lines` right before you try to create `output_string` – Joran Beasley Mar 09 '15 at 17:05
  • @JoranBeasley yes but but `output_lines` is must longer so I shortened it – iqueqiorio Mar 09 '15 at 17:06
  • Your data is **already encoded**, why do you feel the need to encode again? – Martijn Pieters Mar 09 '15 at 17:06
  • @iqueqiorio unfortunately you did more than shorten it ... – Joran Beasley Mar 09 '15 at 17:07
  • @JoranBeasley: it reproduces the problem, I don't see why more is needed? – Martijn Pieters Mar 09 '15 at 17:10
  • @MartijnPieters cause surely it is an encoded acute e ... and we should see the escape code not the encoded character... and it clearly doesnt reprocuce the issue per comments on your solution – Joran Beasley Mar 09 '15 at 17:15

4 Answers4

5

You are trying to encode bytestrings:

>>> '<counter name="Entreé">'.encode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 20: ordinal not in range(128)

Python is trying to be helpful, you can only encode a Unicode string to bytes, so to encode Python first implictly decodes, using the default encoding.

The solution is to not encode data that is already encoded, or first decode using a suitable codec before trying to encode again, if the data was encoded to a different codec than what you needed.

If you have a mix of unicode and bytestring values, decode just the bytestrings or encode just the unicode values; try to avoid mixing the types. The following decodes byte strings to unicode first:

def ensure_unicode(v):
    if isinstance(v, str):
        v = v.decode('utf8')
    return unicode(v)  # convert anything not a string to unicode too

output_string = u'\n'.join([ensure_unicode(line) for line in output_lines])
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • afaik this also indicates he is using python2x ... since in 3x it no longer tries to implicitly convert things and you get a much clearer error (+1 ofc) – Joran Beasley Mar 09 '15 at 17:08
  • @JoranBeasley and Martijn when I change it to `output_string = '\n'.join([line for line in output_lines])` I still get the same error? – iqueqiorio Mar 09 '15 at 17:10
  • 1
    @iqueqiorio: do you have a *mix* of Unicode and byte strings in your list? – Martijn Pieters Mar 09 '15 at 17:11
  • @MartijnPieters I don't think, so it is a long list is there a way to check with an if statement – iqueqiorio Mar 09 '15 at 17:12
  • then you need to post the actual input that is causing an error ... maybe put it in a dpaste ... but as it is we cannot replicate your issue ... and you should post a full traceback ... – Joran Beasley Mar 09 '15 at 17:13
  • @iqueqiorio: that's not a link to a gist; don't worry though, I have it covered. – Martijn Pieters Mar 09 '15 at 17:15
  • @MartijnPieters thats a good solution :) (one i have had to use before ... I still think its better to have well formed input) – Joran Beasley Mar 09 '15 at 17:17
  • @MartijnPieters I got the same error and an error on the line `v = v.decode("utf8")` – iqueqiorio Mar 09 '15 at 17:17
  • surely not `UnicodeDecodeError: ascii codec cannot ...` – Joran Beasley Mar 09 '15 at 17:18
  • I get `Unicode Decode: 'utf8' codec can't decode byte` – iqueqiorio Mar 09 '15 at 17:19
  • try `v.decode("latin1")` ... this is where its really handy to know the encoding you are using ahead of time ;P ... just wait till you get JIS encodings – Joran Beasley Mar 09 '15 at 17:20
  • @iqueqiorio: right, because you never specified what codec your data is encoded in, and I picked a common default for XML data. Where did the data come from? Do you have any more context that would let you determine the correct codec? – Martijn Pieters Mar 09 '15 at 17:20
  • 1
    @JoranBeasley: or cp1252; neither will *fail* but may not produce readable output if it is the wrong codec. – Martijn Pieters Mar 09 '15 at 17:22
  • `"\xe9".decode("utf8") == ERROR` however in latin1 it is acute e (as noted by @MartijnPieters it also works decoding with "cp1252" ... and if you pick the wrong one you will get problems) – Joran Beasley Mar 09 '15 at 17:22
  • 1
    @iqueqiorio: then the web server can have provided you with the codec, or the XML format itself could have included the codec in the metadata. – Martijn Pieters Mar 09 '15 at 17:24
  • @MartijnPieters okay where could I find that info, of what codec they use? – iqueqiorio Mar 09 '15 at 17:26
  • @iqueqiorio: depends; see [retrieve links from web page using python and BeautifulSoup](http://stackoverflow.com/a/22583436) for sample code that retrieves the codec if available in the headers. Note that BeautifulSoup will find codec info in the document itself as needed. – Martijn Pieters Mar 09 '15 at 17:28
4

A simple example of the problem is:

>>> '\xe9'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 0: ordinal not in range(128)

\xe9 isn't an ascii character which means that your string is already encoded. You need to decode it into python's unicode and then encode it again in the serialization format you want.

Since I don't know where your string came from, I just peeked at the python codecs, picked something from Western Europe and gave it a go:

>>> '\xe9'.decode('cp1252')
u'\xe9'
>>> u'\xe9'.encode('utf-8')
'\xc3\xa9'
>>> 

You'll have the best luck if you know exactly which encoding the file came from.

tdelaney
  • 73,364
  • 6
  • 83
  • 116
2

encode = turn a unicode string into a bytestring

decode = turn a bytestring into unicode

since you already have a bytestring you need decode to make it a unicode instance (assuming that is actually what you are trying to do)

output_string = '\n'.join(output_lines)
print output_string.decode("latin1")  #now this returns unicode
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0

Based on what you want to do with your lines, you can do different work here, if you just want to print in consul as normally the consuls use utf8 encoding you dont need to do that by your self as the format of your string is not unicode:

>>> output_string = '\n'.join(output_lines)
>>> print output_string
<menu>
<day name="monday">
<meal name="BREAKFAST">
<counter name="Entreé">
<dish>
<name icon1="Vegan" icon2="Mindful Item">
Cream of Wheat (Farina)
</name>
</dish>
</counter >
</meal >
</day >
</menu > 

But if you want to write to file you can use codecs module:

import codecs
f= codecs.open('out_file','w',encoding='utf8')
Mazdak
  • 105,000
  • 18
  • 159
  • 188