1

I am trying to do the same thing in python as the java code below.

String decoded = new String("中".getBytes("ISO8859_1"), "UTF-8");
System.out.println(decoded);

The output is a Chinese String "中".

In Python I tried the encode/decode/bytearray thing but I always got unreadable string. I think my problem is that I don't really understand how the java/python encoding mechanism works. Also I cannot find a solution from the existing answers.

#coding=utf-8

def p(s):
    print s + ' --  ' + str(type(s))

ch1 = 'ä¸-'
p(ch1)

chu1 = ch1.decode('ISO8859_1')
p(chu1.encode('utf-8'))

utf_8 = bytearray(chu1, 'utf-8')
p(utf_8)

p(utf_8.decode('utf-8').encode('utf-8'))

#utfstr = utf_8.decode('utf-8').decode('utf-8')
#p(utfstr)

p(ch1.decode('iso-8859-1').encode('utf8'))
ä¸- --  <type 'str'>
ä¸Â- --  <type 'str'>
ä¸Â- --  <type 'bytearray'>
ä¸Â- --  <type 'str'>
ä¸Â- --  <type 'str'>

Daniel Roseman's answer is really close. Thank you. But when it comes to my real case:

    ch = 'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤'
    print ch.decode('utf-8').encode('iso-8859-1')

I got

Traceback (most recent call last): File "", line 1, in File "/apps/Python/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 19: invalid start byte

Java code:

    String decoded = new String("masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤".getBytes("ISO8859_1"), "UTF-8");
    System.out.println(decoded);

The output is masanori harigae のパーソナル会�-�室

fanwu72
  • 91
  • 1
  • 5
  • `print( "中".encode('iso-8859-1').decode('utf8') )` – furas Nov 30 '16 at 13:51
  • 2
    Well, already voted to close for lack of repro, but this is also a 100% duplicate of [Python: Converting from ISO-8859-1/latin1 to UTF-8](http://stackoverflow.com/q/6539881/364696). Can someone else close on that basis? – ShadowRanger Nov 30 '16 at 13:55
  • @furas I tried your solution and get below error message:Traceback (most recent call last): File "test.py", line 24, in print( "ä¸-".encode('iso-8859-1').decode('utf8') ) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) – fanwu72 Nov 30 '16 at 13:59
  • It may be problem with console/terminal which doesn't inform Python what encoding it use - so Python/print uses `encode('ascii')` - You have to use `encode(..)` before you print it. – furas Nov 30 '16 at 15:09
  • Java may treads you string as Unicode so in Python you should use `u` prefix `u"masanori ..."` to have the same situation. And then you can do `print u'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤'.encode('iso-8859-1')` – furas Nov 30 '16 at 15:37

1 Answers1

1

You are doing this the wrong way round. You have a bytestring that is wrongly encoded as utf-8 and you want it to be interpreted as iso-8859-1:

>>> ch = "中"
>>> print u.decode('utf-8').encode('iso-8859-1')
中
Daniel Roseman
  • 588,541
  • 66
  • 880
  • 895
  • That's a good answer and I did get "中" with this sample case. But when it comes to my real case, where ch = 'masanori harigae ã\201®ã\203\221ã\203¼ã\202½ã\203\212ã\203«ä¼\232è-°å®¤' , i get UnicodeDecodeError: 'utf8' codec can't decode byte 0x81 in position 19: invalid start byte. When I did similar thing in java, it prints masanori harigae のパーソナル会�-�室. – fanwu72 Nov 30 '16 at 14:47
  • @fanwu72 you should put real case in question. – furas Nov 30 '16 at 15:12