7

I'm trying to write the symbol to a text file in python. I think it has something to do with the encoding (utf-8). Here is the code:

# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write("●")
outFile.close()

Instead of the black "●" I get "â—". How can I fix this?

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Jesper Lundin
  • 241
  • 2
  • 6
  • 2
    Thanks for your answers! I found out that the problem was that Wordpad wouldnt show the dot, but notepad did. So actually it did work from the beginning. – Jesper Lundin Apr 19 '15 at 14:07
  • Python 2 or 3? (Hint: Py3 is better) – jeromej Apr 19 '15 at 14:07
  • Still, there are problems with the code above: it basically only works if (1) your program editor indeed uses UTF-8 (this might have not have been the case) and (2) with text file viewers that use the same encoding as *your* programming editor. You can have a look at my solution, for something that should give "●" on almost any machine, for almost any user, whatever their encoding of choice is. – Eric O. Lebigot Apr 19 '15 at 14:18
  • @JeromeJ: Its python 2 – Jesper Lundin Apr 19 '15 at 14:29
  • @EOL Nice! Good to know! – Jesper Lundin Apr 19 '15 at 14:31

6 Answers6

3

Open the file using the io package for this to work with both python2 and python3 with encoding set to utf8 for this to work. When printing, When writing, write as a unicode string.

import io
outFile = io.open('./myFile.txt', 'w', encoding='utf8')
outFile.write(u'●')
outFile.close()

Tested on Python 2.7.8 and Python 3.4.2

Alok Mysore
  • 606
  • 3
  • 16
  • 1
    This is only for Python 3, and it only works if the desired output is UTF-8 (it does not have to be the same encoding as the one used by the program editor, and it generally varies from machine to machine, especially in the Windows world, and in some countries). – Eric O. Lebigot Apr 19 '15 at 14:09
  • @EOL, That's right. I've updated my answer to make up the deficiency of my old answer. Thanks :) – Alok Mysore Apr 19 '15 at 14:22
  • If you have to use that on a UTF-8 capable system, you do not add much to original code that writes exactly same file ! – Serge Ballesta Apr 19 '15 at 15:06
1

If you are using Python 2, use codecs.open instead of open and unicode instead of str:

# -*- coding: utf-8 -*-
import codecs
outFile = codecs.open('./myFile.txt', 'wb', 'utf-8')
outFile.write(u"●")
outFile.close()

In Python 3, pass the encoding keyword argument to open:

# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'w', encoding='utf-8')
outFile.write("●")
outFile.close()
Elektito
  • 3,863
  • 8
  • 42
  • 72
0
>>> ec = u'\u25cf' # unicode("●", "UTF-8")
>>> open("/tmp/file.txt", "w").write(ec.encode('UTF-8'))
felipsmartins
  • 13,269
  • 4
  • 48
  • 56
0

What your program does is to produce an output file in the same encoding as your program editor (the coding at the top does not matter, unless your program editor uses it for saving the file). Thus, if you open myFile.txt with a program that uses the same encoding as your program editor, everything looks fine.

This does not mean that your program works for everybody.

For this, you must do two things. You must first indicate the encoding used for text files on your machine. This is a little hard to detect, but the following should often work:

# coding=utf-8  # Put your editor's encoding here

import codecs
import locale
import sys

# Selection of the first non-None, reasonable encoding:
out_encoding = (locale.getlocale()[1]
                or locale.getpreferredencoding()
                or sys.stdin.encoding or sys.stdout.encoding
                # Default:
                or "UTF8")

outFile = codecs.open('./myFile.txt', 'w', out_encoding)

Note that it is very important to specify the right coding on top of the file: this must be your program editor's encoding.

If you know the encoding you want for your output file, you can directly put it in open(). Otherwise, the more general and portable out_encoding expression above should work for most users on most computers (i.e., whatever their encoding of choice is, they should be able to read "●" in the resulting file—assuming their computer's encoding can represent it).

Then you must print a string, not bytes:

outFile.write(u"●")

(note the leading u, meaning "unicode string").

For a deeper understanding of the issues at hand, one of my previous answers should be very helpful: UnicodeDecodeError when redirecting to file.

Community
  • 1
  • 1
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
0

This should do the trick

# -*- coding: utf-8 -*-
outFile = open('./myFile.txt', 'wb')
outFile.write(u"\u25CF".encode('utf-8'))
outFile.close()

have a look at this

Heisenberg
  • 1,500
  • 3
  • 18
  • 35
0

I'm very sorry, but writing a symbol to a text file without saying what the encoding of the file should be is simply non-sense.

It may not be evident at first sight, but text files are indeed encoded and may be encoded in different ways. If you have only letters (upper and lower case, but not accented oned), digits and simple symbols (everything that has an ASCII code below 128), all should be fine, because ASCII 7 bits is now a standard and in fact those characters have same representation in major encodings.

But as soon as you get true symbols, or accented chars, their representation vary from one encoding to the other. For example, the symbol ● has a UTF-8 representation of (Python coding) : \xe2\x97\x8f. What is worse, it cannot be represented in latin1 (ISO-8859-1) encoding.

Another example is the french e accent aigu : é it is represented in UTF8 as \xc3\xa9 (note 2 bytes), but is represented in Latin1 as \x89 (one single byte)

So I tested your code in my Ubuntu box using a UTF8 encoding and the command cat myFile.txt ... correctly showed the bullet !

sba@sba-ubuntu:~/stackoverflow$ cat myFile.txt 
●sba@sba-ubuntu:~/stackoverflow$ 

(as you didn't add any newline after the bullet, the prompt immediately follows it)

In conclusion :

Your code correctly writes the bullet to the file in UTF8 encoding. If your system uses natively another encoding (ISO-8859-1 or its variant Windows-1252) you cannot natively convert it because this character simply does not exist in this encodings.

But you can always see it in a text editor that supports different encoding like the excellent vim that exists on all major systems.


Proof of above :

On a Windows 7 computer, I opened a vim window and instructed it to accept utf8 with :set encoding='utf8'. I then pasted original code from OP and saved it to a file foo.py.

I opened a cmd.exe window and executed python foo.py (using a Python 2.7) : it created a file myFile.txt containing the 3 bytes (hexa) : e2 97 8f that is the utf8 representation of the bullet (I could confirm it with vim Tools/Hexa convert).

I could even open myFile.txt in idle and actually saw the bullet. Even notepad.exe could show the bullet !

So even on a Windows 7 computer that does not natively accept utf-8, the code from OP correctly generates a text file that when opened with a text editor accepting UTF-8 contains the bullet .

Of course, if I try to open myFile.txt with vim in latin1 mode, I get : â—, on a cmd windows with codepage 850, type myFile.txt shows ÔùÅ, and with codepage 1252 (variant of latin1) : â—.

In conclusion original OP code creates a correct utf8 encoded file - it is up to the reading part to interpret correctly utf8.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • I also fixed what you say about the "unicode representation", which is something that does not exist: you were actually quoting the UTF-8 representation (which is not the only encoding in Unicode). – Eric O. Lebigot Apr 19 '15 at 14:52
  • Finally, it is not exactly true that the code "correctly writes in the UTF8 encoding": it only does if the program editor used is indeed using UTF8 (which is far from always the case, especially on Windows). The general fact about the code in the question is that its output file is in the same encoding as the program editor (the `coding` at the top does not matter). – Eric O. Lebigot Apr 19 '15 at 14:55
  • @EOL : It does write a file that can be read with any editor that can process UTF-8. In that sense, it does write a text file in UTF-8 encoding. And vim (or gvim) works perfectly in Windows (where it is my favorite general text editor). Anyway, thanks for the fixes :-) – Serge Ballesta Apr 19 '15 at 15:03
  • I must disagree: just input the original program in an editor/machine that does not use UTF-8, run the program, and you will see that the output file is not in UTF-8. Even if you don't have easy access to such an editor, you can change the top line into `# coding=latin1` (in a pure text editor that ignores this line): you will see that the output is not in Latin 1: in other words, the top line does not matter for the encoding of the output. The only canonical encoding in the original program is the one used by the program editor. So, I maintain that the program does not generally write UTF-8. – Eric O. Lebigot Apr 20 '15 at 03:59
  • @EOL : The top line `# -*- coding: utf-8 -*-` instruct python interpreter (and optionaly some text editor like idle one) that **input** script file is utf8 encoded and has no direct impact on output. But see my edit. – Serge Ballesta Apr 20 '15 at 08:52
  • That's my point exactly: the output of the program in the question is indeed not influenced by the `coding` line. Again, the encoding of the output is instead that of the program editor. The fact that the program in the question contains `coding: utf-8` does not mean that it is indeed the encoding used by his program editor (even though it should be, but it could have been put here by a non-UTF-8 editor, only with the hope that it force an UTF-8 output—which it does not, as we both agree on). If you edit the program in a non-UTF-8 compatible editor, you will see that it does not produce UTF-8. – Eric O. Lebigot Apr 20 '15 at 13:10
  • @EOL : if you want to be immune to editor questions, you must use no `coding` line and use escaped representation for non ASCII characters. In the example it would be `outFile.write("\xe2\x97\x8f")` or as answered by Heisenberg and felipsmartins : `outFile.write(u"\u25cf".encode('utf-8'))` – Serge Ballesta Apr 20 '15 at 13:59