1

I got some strings from the database which look like '\xe7\x8e\xa9'.

I think it's utf-8. I can print them out by using:

print '\xe7\x8e\xa9'
玩

The things is, I need write them into another file as Chinese Character(e.g. 玩) together with other alphanumeric data.

I tried encode, decode but I didn't get the results I was hoping for.

Here are my attempts:

f = open('a','w')
name = u.name #.encode('utf8')  # I commented it to get raw
f.write('\t$$%r$$many_other_data' % name) 
f.close()

When I open the output file with vim7.4:

 `$$u'\u7aef\u5e84\u7684\u9a6c\u6b47\u5c14$$many_other_data'`
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
Zen
  • 4,381
  • 5
  • 29
  • 56
  • Are they exactly like that in the raw form when opened from a text editor, or after reading they look like that using the `__repr__` method? Also, is this Python 2 or 3? – metatoaster Aug 12 '14 at 09:56

3 Answers3

1

Files are bytes. You can't store characters in them.

A particularly common encoding is ASCII. It's an encoding just like all those different unicode ones.

The bytes are meaniningless (as text) on their own without an associated encoding to give them meaning.

You'll need to view the file with an editor or viewer that is using the same encoding that you used to write the file.

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • can u'\u5916\u5411\u7684\u95ea\u7535' be convert into Chinese and be written in another file as Chinese? It's unicode. – Zen Aug 12 '14 at 10:04
  • I declared # encoding:utf-8 at the beginning of my process py file. But I still got '\xe7\x8e\xa9' in the output file. I used write method. It seems I convert these \xe7 stuff into normal strings? I'm totally confused. – Zen Aug 12 '14 at 10:08
1

Since you have bytes, you need to know your encoding. There are multiple ways to turn bytes into unicode (str.decode), and it's depending on what encoding the bytes are in.

You can't get this from the bytes themselves, someone has to tell you the encoding.

Although, sometimes you can make an educated guess:

>>> import chardet
>>> s = '\xe7\x8e\xa9'
>>> chardet.detect(s)
{'confidence': 0.505, 'encoding': 'utf-8'}
>>> s.decode(chardet.detect(s)['encoding'])
u'\u73a9'
>>> print _
玩

Now, you should convert any strings from db to unicode as soon as they enter your python program so that your code is working entirely in unicode, not bytes.

Then, you can write your file like this:

import io
with io.open('/tmp/myfile.txt', 'wb', encoding='utf-8') as f:
    f.write(u'\u73a9')
    f.write('\n')
    f.write('random other data 12345...')
wim
  • 338,267
  • 99
  • 616
  • 750
1

Here is code sample working for me:

with open('foo', 'w+') as f:
    f.write('\xe7\x8e\xa9')

and in foo file a have:

but, I've open foo with utf-8 encoding, so it's displays chines character instead of Unicode value.

I've tested it with both vim and gedit and it works just fine.

Perhaps you should provide type of your output file, so we can be more specific.

EDIT

I see the problem now. You used %r flag in writing your string. You should use %s (and enable encoding again).

Here is working example:

>>> a = u'\u7aef\u5e84\u7684\u9a6c\u6b47\u5c14'
>>> f = open('tmp', 'w')
>>> a = a.encode('utf-8')
>>> f.write('\t$$%r$$other_data\n'%a)
>>> f.write('\t$$%s$$other_data\n'%a)
>>> f.close

results being:

    $$'\xe7\xab\xaf\xe5\xba\x84\xe7\x9a\x84\xe9\xa9\xac\xe6\xad\x87\xe5\xb0\x94'$$other_data
    $$端庄的马歇尔$$other_data

Please ready this answer for reference about difference between %r and %s.

Hope that helped.

Community
  • 1
  • 1
Pawel Wisniewski
  • 430
  • 4
  • 21