Python - file.write() causes chinese text

Question

When I write a certain string to a file in an infinite loop, for example:

file = open('txt.txt', 'w')
while 1:
    file.write('colour')

It gives me all this chinese text: Picture

Why does this happen?

This works for me and may be just an encoding issue of your text editor. Have you tried viewing it in another program? Also, it would be interesting how a pure ASCII dump looks like. — Monkey Supersonic, Sep 05 '16 at 20:16

Trevor Merrifield · Accepted Answer · 2016-09-05T20:50:41.477

You can get the same result by copy pasting colour several times in notepad then saving and reloading the file. There's nothing wrong with your python code. The bytes written to the file will look something like this (in hex):

63 CF 6C 6F 75 72  63 CF 6C 6F 75 72 ...

When notepad reads these bytes it needs to guess what they represent. It would ideally decode the text as utf-8 or ascii. Instead it sees a pattern in the bytes and guesses wrong.

I noticed that every pair of bytes corresponds to one chinese character. This suggests the encoding might be utf-16. The following test in python confirms that this is the case:

>>> original = 'colour' * 100
>>> original.encode('utf-8').decode('utf-16')
\u6f63\u6f6c\u7275\... # repeating

These code points correspond to 潣, 潬, and 牵 which is the same thing that notepad displays. So the issue is that notepad is incorrectly decoding your bytes as utf-16 instead of utf-8. This is reminiscent of the old Bush hid the facts bug.

score 0 · Answer 2 · edited May 23 '17 at 12:22

0

I believe your encoding is set as an inproper default (possible on install or based on your computers settings)

you can change it by:

import sys
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')

Check this thread out for more info Changing default encoding of Python?

edited May 23 '17 at 12:22

Community

1
1

answered Sep 05 '16 at 20:21

Ch1pCh4p

1

Nope this problem isn't on python's end, it's notepad. – Trevor Merrifield Sep 05 '16 at 20:23
Hmm.... It's possible. I just assumed we needed to hear what OP had to say to Monkey Supersonic's suggestion (which follows your thought). – Ch1pCh4p Sep 05 '16 at 20:31
Yup Monkey Supersonic was right. It's not hard to reproduce the problem using only notepad, or even a hex editor. The utf-8 bytes are misinterpreted as utf-16 by notepad. – Trevor Merrifield Sep 05 '16 at 20:56

Python - file.write() causes chinese text

2 Answers2