Convert Python string between iso8859_1 and utf-8

Question

I am trying to do the same thing in python as the java code below.

String decoded = new String("ä¸".getBytes("ISO8859_1"), "UTF-8");
System.out.println(decoded);

The output is a Chinese String "中".

In Python I tried the encode/decode/bytearray thing but I always got unreadable string. I think my problem is that I don't really understand how the java/python encoding mechanism works. Also I cannot find a solution from the existing answers.

#coding=utf-8

def p(s):
    print s + ' --  ' + str(type(s))

ch1 = 'ä¸-'
p(ch1)

chu1 = ch1.decode('ISO8859_1')
p(chu1.encode('utf-8'))

utf_8 = bytearray(chu1, 'utf-8')
p(utf_8)

p(utf_8.decode('utf-8').encode('utf-8'))

#utfstr = utf_8.decode('utf-8').decode('utf-8')
#p(utfstr)

p(ch1.decode('iso-8859-1').encode('utf8'))

ä¸- --  <type 'str'>
Ã¤Â¸Â- --  <type 'str'>
Ã¤Â¸Â- --  <type 'bytearray'>
Ã¤Â¸Â- --  <type 'str'>
Ã¤Â¸Â- --  <type 'str'>

Daniel Roseman's answer is really close. Thank you. But when it comes to my real case:

    ch = 'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤'
    print ch.decode('utf-8').encode('iso-8859-1')

I got

Traceback (most recent call last): File "", line 1, in File "/apps/Python/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 19: invalid start byte

Java code:

    String decoded = new String("masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤".getBytes("ISO8859_1"), "UTF-8");
    System.out.println(decoded);

The output is masanori harigae のパーソナル会�-�室

Well, already voted to close for lack of repro, but this is also a 100% duplicate of [Python: Converting from ISO-8859-1/latin1 to UTF-8](http://stackoverflow.com/q/6539881/364696). Can someone else close on that basis? — ShadowRanger, Nov 30 '16 at 13:55
@furas I tried your solution and get below error message:Traceback (most recent call last): File "test.py", line 24, in print( "ä¸-".encode('iso-8859-1').decode('utf8') ) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) — fanwu72, Nov 30 '16 at 13:59
It may be problem with console/terminal which doesn't inform Python what encoding it use - so Python/print uses `encode('ascii')` - You have to use `encode(..)` before you print it. — furas, Nov 30 '16 at 15:09
Java may treads you string as Unicode so in Python you should use `u` prefix `u"masanori ..."` to have the same situation. And then you can do `print u'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤'.encode('iso-8859-1')` — furas, Nov 30 '16 at 15:37

score 1 · Answer 1 · answered Nov 30 '16 at 14:27

1

You are doing this the wrong way round. You have a bytestring that is wrongly encoded as utf-8 and you want it to be interpreted as iso-8859-1:

>>> ch = "ä¸"
>>> print u.decode('utf-8').encode('iso-8859-1')
中

answered Nov 30 '16 at 14:27

Daniel Roseman

588,541
66
880
895

That's a good answer and I did get "中" with this sample case. But when it comes to my real case, where ch = 'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤' , i get UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 19: invalid start byte. When I did similar thing in java, it prints masanori harigae のパーソナル会�-�室. – fanwu72 Nov 30 '16 at 14:47
@fanwu72 you should put real case in question. – furas Nov 30 '16 at 15:12

Convert Python string between iso8859_1 and utf-8

Daniel Roseman's answer is really close. Thank you. But when it comes to my real case:

1 Answers1