Python UTF-8 can't decode byte on 32-bit machine

Question

it works fine on 64 bit machines but for some reason will not work on python 2.4.3 on a 32-bit instance.

i get the error

'utf8' codec can't decode bytes in position 76-79: invalid data

for the code

try:        
    str(sourceresult.sourcename).encode('utf8','replace')
except:
    raise Exception(  repr(sourceresult.sourcename ) )

it returns 'kazamidori blog\xf9'

i have modified my site.py file to make UTF8 the default encoding, but still doesnt seem to be working.

I strongly doubt this is a 32/64 bit issue. What character set are you encoding this *from*? — Pekka, Apr 01 '10 at 18:29
well it should be in UTF-8 because its data being pulled from a MySQL table that has a default encoding of UTF-8, which is why I am a bit confused. maybe mysqldb encodes the data in their own way? — JiminyCricket, Apr 01 '10 at 18:42
turns out that the default UTF8 connection was turned off because of some incompatibility between the 32-bit EC2 server and mysqldb. i believe i am converting from ASCII then. i just removed the encode('utf8','replace') string and am getting a different error 'utf8' codec can't decode byte 0xf9 in position 15: unexpected end of data — JiminyCricket, Apr 01 '10 at 19:04
I think you need to take a step back, think through the encode/decode process, and make sure you know what encoding your strings have at each step of the way. You need to be absolutely sure of a) whether sourceresult.sourcename is unicode and b) if not, what encoding it has, before anyone can help you out. — DNS, Apr 01 '10 at 19:15

score 7 · Accepted Answer · edited May 23 '17 at 11:59

7

We need the following, and we need the exact output:

type(sourceresult.sourcename) # I suspect it's already a UTF-8 encoded string

repr(sourceresult.sourcename)

Like I said, I'm almost certain that your sourceresult.sourcename is already a UTF-8 encoded string.

Perhaps this might help a little.

EDIT: it seems your sourceresult.sourcename is encoded as cp1252. I don't know what mystring (that you reference in a comment) is. So, to get a UTF-8 encoded string, you need to do:

source_as_UTF8= sourceresult.sourcename.decode("cp1252").encode("utf-8")

However, the string being cp1252-encoded is not consistent with the error message you supplied.

edited May 23 '17 at 11:59

Community

1
1

answered Apr 01 '10 at 19:00

tzot

92,761
29
141
204

this is the repr 'kazamidori blog\\xf9'" this is the type is there anyway to find out what type of string? – JiminyCricket Apr 01 '10 at 19:16
assuming that it was already UTF8, i tried this mystring.decode('utf8','replace') but that only return the first character of the string – JiminyCricket Apr 01 '10 at 19:18
i was able to fix it by doing (sourceresult.sourcename).decode('cp1252').encode('utf8') how were you able to tell that it was cp1252? – JiminyCricket Apr 01 '10 at 19:36
Because it's the "Windows Western" encoding, and thus the safest bet :) It also helped that the resulting "kazamidori blogù" has hits in Google. BTW, whenever you find that an answer is the one that solves your problem, you should click the checkmark (✓) under the answer's vote count. – tzot Apr 01 '10 at 21:08
+1 Well spotted, ΤΖΩΤΖΙΟΥ. A wise man once said "If the encoding of some data is stated to be unknown or ISO-8859-1, it is in fact cp1252". – John Machin Apr 01 '10 at 21:54
thanks, good to know. i wanted to vote your post up, but i dont have enough reputation yet =( – JiminyCricket Apr 01 '10 at 22:02

score 0 · Answer 2 · answered Apr 01 '10 at 18:39

"Invalid Data" usually means that the incoming data contained characters outside its character set.

This is often caused by, at some point, some data being encoded in a character set different than UTF-8.

For example, if the file a string is stored in was not converted into UTF-8 when you made UTF-8 the standard character set. (In Windows, you can usually specify a file's encoding in the "Save as..." dialog of your text editor)

Or, when data comes from a database that uses a different character set in either the tables, the connection, or both.

Check out where the data comes from, and what encodings are set along the way.

score 0 · Answer 3 · answered Apr 01 '10 at 19:04

0

I think the problem is with your use of the str() function. Keep in mind that str() returns narrow, i.e. 1-byte-per-character strings. If the input, sourceresult.sourcename, is unicode, then Python automatically encodes it in order to return a narrow string. By default it uses the system encoding, which is likely something like ISO-8859-1, to do this.

So you're getting the error because it doesn't make sense to call encode on a string that is already encoded. If you get rid of the str(), it should work.

answered Apr 01 '10 at 19:04

DNS

37,249
18
95
132

Yeah, my answer is only applicable if, as you originally said, the source string is unicode. If it, as it now appears, isn't, then you'll need to figure out what the database is encoding it to before I can suggest anything. – DNS Apr 01 '10 at 19:21
yup sorry for the confusion. i thought it was unicode. the main problem here is that the data isnt a standard encodingi guess. i was able to fix it by doing (sourceresult.sourcename).decode('cp1252').encode('utf8') this is based on ΤΖΩΤΖΙΟΥ saying that it was cp1252, im curious to know how he found that out. will comment on his post. – JiminyCricket Apr 01 '10 at 19:35

score 0 · Answer 4 · answered Jan 04 '11 at 18:11

0

Make sure you don't have an odd number of bytes in your varchar field; I had a varchar(255) that blew up when someone entered a long string in Arabic. I then got the "unexpected end of data" error (as one might expect...!)

answered Jan 04 '11 at 18:11

Johnny O

587
4
4

Python UTF-8 can't decode byte on 32-bit machine

4 Answers4