0

I'm studying python(2.7) scrapy. I try to read a file which is utf-16-le encoded, each line of the file is unicode string, but it contains ascii characters.

str1 = u'Asus,\xe9\xa3\x9e\xe9\xa9\xac'
print type(str1), str1
# print 'decoding', str1.decode('utf-8')        # it throws UnicodeEncodeError

str2 = 'Asus,\xe9\xa3\x9e\xe9\xa9\xac'
print type(str2), str2
print 'decoding', str2.decode('utf-8')

The output of console is:

<type 'unicode'> Asus,é£é©¬
<type 'str'> Asus,飞马
decoding Asus,飞马

How can i convert str1 to 'Asus,飞马' liked unicode string, all answer will be appreciated.

彭泽鑫
  • 99
  • 1
  • 2
  • 8

1 Answers1

0

I wonder how you got str1, it may be the result of an improper manipulation. The following works for me

>>> str1 = u'Asus,\xe9\xa3\x9e\xe9\xa9\xac'
>>> str1.encode('iso8859-1')
'Asus,\xe9\xa3\x9e\xe9\xa9\xac'
Gribouillis
  • 2,230
  • 1
  • 9
  • 14
  • I download a file, then read from it. This way works fine when typing the str1, but dosn't work when reading from the file. the output file is still contains ascii character, my code is like this. `with open(filepath, 'r') as fp: line = fp.read() if line.startswith('\xff\xfe'): encoding = 'utf-16-le' fp2 = codecs.open(filepath, 'r', encoding) line_list = fp2.readlines() fp2.stream.close() fp3 = codecs.open(os.getcwd() + '/full/destfile.csv', 'w') for i in line_list: j = i.encode('iso8859-1') fp3.write(j.decode('utf-8')) fp3.close()` – 彭泽鑫 Dec 09 '16 at 15:58
  • You may be reading the file with the wrong encoding based on an optional BOM. You could perhaps try the [chardet](https://pypi.python.org/pypi?%3Aaction=search&term=chardet&submit=search) module to detect automatically the encoding. – Gribouillis Dec 09 '16 at 16:22
  • Your comment is right, i search for it and find this answer [link](http://stackoverflow.com/questions/9177820/python-utf-16-csv-reader) then figured out the str1 is actually 'Asus,\\xe9\\xa3\\x9e\\xe9\\xa9\\xac', those double backslashes in the str1 is the reason for the Mojibake. After converting it and decoding them, it turn out to be worked. – 彭泽鑫 Dec 10 '16 at 06:02