How to print non-ascii characters to file in Python 2.7

Question

I'm trying to obfuscate some javascript by altering their character codes, but I've found that I can't correctly print characters outside of a certain range, in Python 2.7.

For example, here's what I'm trying to do:

f = open('text.txt','w')
f.write(unichr(510).encode('utf-8'))
f.close()

I can't write unichr(510) because it says the ascii codec is out of range. So I encode it with utf-8. This turns a single character u'\u01fe' into two '\xc7\xbe'.

Now, in javascript, it's easy to get the symbol for the character code 510:

String.fromCharCode(510)

Gives the single character: Ǿ

What I'm getting with Python is two characters: Ç¾

If I pass those characters to javascript, I can't retrieve the original single character.

I know that it is possible to print the Ǿ character in python, but I haven't been able to figure it out. I've gotten as far as using unichr() instead of chr(), and encoding it to 'utf-8', but I'm still coming up short. I've also read that Python 3 has this functionality built-in to the chr() function. But that won't help me.

Does anyone know how I can accomplish this task?

Thank you.

How are you passing the `'\xc7\xbe'` to JavaScript? Those two consecutive bytes (not to be confused with the characters Ç¾) are the UTF-8 encoding of Ǿ, which JavaScript should recognize as such (or at least treat no differently than a Ǿ appearing in a UTF-8 encoded JS file). — jwodder, Apr 08 '13 at 01:22
I'm saving the `'\xc7\xbe'` to a javascript file. Also, it is treating it as two separate characters. @jwodder — bozdoz, Apr 08 '13 at 01:25

Sheng · Accepted Answer · 2013-04-08T01:32:32.863

4

You should open the file in binary mode:

f = open('text.txt','wb')

And then write the bytes (in Python 3):

f.write(chr(510).encode('utf-8'))

Or in Python 2:

f.write(unichr(510).encode('utf-8'))

Finally, close the file

f.close()

Or you could do it in a better manner like this:

>>> f = open('e:\\text.txt','wt',encoding="utf-8")
>>> f.write(chr(510))
>>> f.close()

After that, you could read the file as:

>>> f = open('e:\\text.txt','rb')
>>> content = f.read().decode('utf-8')
>>> content
'Ǿ'

Or

>>> f = open('e:\\text.txt','rt',encoding='utf-8')
>>> f.read()
'Ǿ'

Tested on my Win7 and Python3. It should works with Python 2.X

edited Apr 08 '13 at 01:32

answered Apr 08 '13 at 01:25

Sheng

3,467
1
17
21

Doesn't seem to change. Still getting those two characters. – bozdoz Apr 08 '13 at 01:27
1

You should tell your text editor to open it in uft-8 encoding. But it works perfectly with my Win7+Python3.3+notepad(or UltraEdit). – Sheng Apr 08 '13 at 01:37
looks like that might be the solution to the problem. Hopefully it will port over to javascript as easily. Thanks! – bozdoz Apr 08 '13 at 01:42
My pleasure to help you. I just tested it on my Win7+Python2.7. It also perfectly works. You can open it with notepad to check the result. It is possibly the problem of notepad++. – Sheng Apr 08 '13 at 01:48

score 4 · Answer 2 · answered Apr 08 '13 at 01:49

4

How about this?

import codecs
outfile = codecs.open(r"C:\temp\unichr.txt", mode='w', encoding="utf-8")
outfile.write(unichr(510))
outfile.close()

answered Apr 08 '13 at 01:49

bbayles

4,389
1
26
34

This worked perfectly for me using python 2.7, thank you. – elPastor Jun 27 '17 at 21:02

score 1 · Answer 3 · answered Apr 08 '13 at 01:25

1

Python is writing the bytes '\xc7\xbe' to the file:

In [45]: unichr(510).encode('utf-8')
Out[45]: '\xc7\xbe'

JavaScript is apparently forming the unicode u'\xc7\xbe' instead:

In [46]: 'Ç¾'.decode('utf-8')
Out[46]: u'\xc7\xbe'

In [47]: 'Ç¾'.decode('utf-8').encode('latin-1')
Out[47]: '\xc7\xbe'

The problem is in how JavaScript is converting the bytes to unicode, not in how Python is writing the bytes.

answered Apr 08 '13 at 01:25

unutbu

842,883
184
1,785
1,677

The file is javascript. I'm decoding the js with a for loop, and adjusting each character with something like this: String.fromCharCode( l.charCodeAt(i) - 150 ); – bozdoz Apr 08 '13 at 01:30
Also, I can see by viewing the file that Python is writing two characters when it should be writing one. – bozdoz Apr 08 '13 at 01:30
The for loop is intended to iterate over each character, so it is iterating over each byte, which is not what I want. – bozdoz Apr 08 '13 at 01:32
What single byte do you want written to the file? The choice has to range from `'\x00'` to `'\xff'` (256 choices). – unutbu Apr 08 '13 at 01:33
I don't know @unutbu. Believe it or not, I accidentally/somehow printed the Ǿ with python, but have no idea how I did it, and I'm trying to repeat my steps to no avail. – bozdoz Apr 08 '13 at 01:35
When you see `Ǿ` in a file, your editor is actually playing a trick on you. The file contains nothing but bytes. The editor chooses to decode those bytes using a codec, such as utf-8 or cp1252 or ascii, for example. The decoding associates byte sequences into characters with glyphs such as `Ǿ`. But the underlying file is still nothing but bytes. I don't know JavaScript, but if you can tell us what **bytes** are supposed to be in the JavaScript file, we can of course tell you how to write those bytes to a file in Python. – unutbu Apr 08 '13 at 01:42
I think you're right. The editor was playing a trick on me. I'm able to toggle back and forth between those two characters by changing Notepad++'s from `UTF-8 without BOM` to `UTF-8`. Thanks for your help! – bozdoz Apr 08 '13 at 01:43

How to print non-ascii characters to file in Python 2.7

3 Answers3

Linked