Unicode (UTF-8) reading and writing to files in Python

Question

I'm having some brain failure in understanding reading and writing text to a file (Python 2.4).

# The string, which has an a-acute in it.
ss = u'Capit\xe1n'
ss8 = ss.encode('utf8')
repr(ss), repr(ss8)

("u'Capit\xe1n'", "'Capit\xc3\xa1n'")

print ss, ss8
print >> open('f1','w'), ss8

>>> file('f1').read()
'Capit\xc3\xa1n\n'

So I type in Capit\xc3\xa1n into my favorite editor, in file f2.

Then:

>>> open('f1').read()
'Capit\xc3\xa1n\n'
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'
>>> open('f1').read().decode('utf8')
u'Capit\xe1n\n'
>>> open('f2').read().decode('utf8')
u'Capit\\xc3\\xa1n\n'

What am I not understanding here? Clearly there is some vital bit of magic (or good sense) that I'm missing. What does one type into text files to get proper conversions?

What I'm truly failing to grok here, is what the point of the UTF-8 representation is, if you can't actually get Python to recognize it, when it comes from outside. Maybe I should just JSON dump the string, and use that instead, since that has an asciiable representation! More to the point, is there an ASCII representation of this Unicode object that Python will recognize and decode, when coming in from a file? If so, how do I get it?

>>> print simplejson.dumps(ss)
'"Capit\u00e1n"'
>>> print >> file('f3','w'), simplejson.dumps(ss)
>>> simplejson.load(open('f3'))
u'Capit\xe1n'

The important thing to understand is that `u'Capit\xe1n\n'` **is a correct result**, and that string **already does contain** the special character you are looking for. It is only being **represented with** an escape sequence. The underlying question here is **not actually anything to do with** how to read or write files and specify an encoding, because the code **already shows how to do that correctly**. — Karl Knechtel, Aug 29 '23 at 19:28

score 871 · Answer 1 · edited Sep 03 '22 at 22:03

871

Rather than mess with .encode and .decode, specify the encoding when opening the file. The io module, added in Python 2.6, provides an io.open function, which allows specifying the file's encoding.

Supposing the file is encoded in UTF-8, we can use:

>>> import io
>>> f = io.open("test", mode="r", encoding="utf-8")

Then f.read returns a decoded Unicode object:

>>> f.read()
u'Capit\xe1l\n\n'

In 3.x, the io.open function is an alias for the built-in open function, which supports the encoding argument (it does not in 2.x).

We can also use open from the codecs standard library module:

>>> import codecs
>>> f = codecs.open("test", "r", "utf-8")
>>> f.read()
u'Capit\xe1l\n\n'

Note, however, that this can cause problems when mixing read() and readline().

edited Sep 03 '22 at 22:03

Karl Knechtel

62,466
11
102
153

answered May 10 '09 at 00:45

Tim Swast

14,091
4
38
61

74

Works perfectly for writing files too, instead of `open(file,'w')` do `codecs.open(file,'w','utf-8')` solved – Matt Connolly Mar 04 '11 at 02:12
6

Does the `codecs.open(...)` method also fully conform to the `with open(...):` style, where the `with` cares about closing the file after all is done? It seems to work anyway. – try-catch-finally Mar 04 '13 at 18:09
2

@try-catch-finally Yes. I use `with codecs.open(...) as f:` all the time. – Tim Swast Jul 08 '13 at 14:27
7

I wish I could upvote this a hundred times. After agonizing for several days over encoding issues caused by a lot of mixed data and going cross-eyed reading about encoding, this answer is like water in a desert. Wish I'd seen it sooner. – Mike Girard Jul 21 '13 at 18:24
Great catch! I was trying to clean up code downstream; I went straight to the source of the problem with `io.open(filename,'r',encoding='utf-8') as file:` – Pat Grady Feb 08 '18 at 14:32
Use `encoding="utf-8-sig"` if there's any chance your file will have a BOM (works in Python 2.7) – Perry Oct 30 '19 at 23:22

score 122 · Accepted Answer · edited Sep 03 '22 at 22:29

122

In the notation u'Capit\xe1n\n' (should be just 'Capit\xe1n\n' in 3.x, and must be in 3.0 and 3.1), the \xe1 represents just one character. \x is an escape sequence, indicating that e1 is in hexadecimal.

Writing Capit\xc3\xa1n into the file in a text editor means that it actually contains \xc3\xa1. Those are 8 bytes and the code reads them all. We can see this by displaying the result:

# Python 3.x - reading the file as bytes rather than text,
# to ensure we see the raw data
>>> open('f2', 'rb').read()
b'Capit\\xc3\\xa1n\n'

# Python 2.x
>>> open('f2').read()
'Capit\\xc3\\xa1n\n'

Instead, just input characters like á in the editor, which should then handle the conversion to UTF-8 and save it.

In 2.x, a string that actually contains these backslash-escape sequences can be decoded using the string_escape codec:

# Python 2.x
>>> print 'Capit\\xc3\\xa1n\n'.decode('string_escape')
Capitán

The result is a str that is encoded in UTF-8 where the accented character is represented by the two bytes that were written \\xc3\\xa1 in the original string. To get a unicode result, decode again with UTF-8.

In 3.x, the string_escape codec is replaced with unicode_escape, and it is strictly enforced that we can only encode from a str to bytes, and decode from bytes to str. unicode_escape needs to start with a bytes in order to process the escape sequences (the other way around, it adds them); and then it will treat the resulting \xc3 and \xa1 as character escapes rather than byte escapes. As a result, we have to do a bit more work:

# Python 3.x
>>> 'Capit\\xc3\\xa1n\n'.encode('ascii').decode('unicode_escape').encode('latin-1').decode('utf-8')
'Capitán\n'

edited Sep 03 '22 at 22:29

Karl Knechtel

62,466
11
102
153

answered Jan 29 '09 at 15:11

So, what's the point of the utf-8 encoded format if python can read in files using it? In other words, is there any ascii representation that python will read in \xc3 as 1 byte? – Gregg Lind Jan 29 '09 at 16:51
4

The answer to your "So, what's the point…" question is "Mu." (since Python can read files encoded in UTF-8). For your second question: \xc3 is not part of the ASCII set. Perhaps you mean "8-bit encoding" instead. You are confused about Unicode and encodings; it's ok, many are. – tzot Jan 30 '09 at 12:16
9

Try reading this as a primer: http://www.joelonsoftware.com/articles/Unicode.html – tzot Jan 30 '09 at 12:16
note: `u'\xe1'` is one Unicode codepoint [`U+00e1`](http://codepoints.net/U+00e1) that can be represented using 1 or *more* bytes depending on character encoding (it is 2 bytes in utf-8). `b'\xe1'` is one byte (a number 225), what letter if any it can represent depends on character encoding used to decode it e.g., it is [`б` (`U+0431`)](http://codepoints.net/U+0431) in cp1251, [`с` (`U+0441`)](http://codepoints.net/U+0441) in cp866, etc. – jfs Jun 15 '13 at 06:31
13

It is amazing how many British coders say "just use ascii" and then fail to realise that the £ sign is not it. Most are not aware that ascii!=local code page (ie latin1). – Danny Staple Sep 05 '13 at 12:58
To your last point I get an error message writing bytes to a file. `write() argument must be str, not bytes` both in python2 and python3.7 – vi_ral Apr 18 '21 at 00:51
@vi_ral yes, in 3.x, `.encode`ing a string produces a `bytes`, which can only be written to a file that was open in binary mode. It should work in 2.x, because 2.x's incorrect, buggy handling of text pretends that `bytes` constitute a string (in 2.7 - I'm not sure how far back it goes - `str` is actually aliased to `bytes`, while `unicode` is the name for the *actual* string type). – Karl Knechtel Sep 03 '22 at 21:55

score 82 · Answer 3 · edited Mar 04 '18 at 08:20

Now all you need in Python3 is open(Filename, 'r', encoding='utf-8')

[Edit on 2016-02-10 for requested clarification]

Python3 added the encoding parameter to its open function. The following information about the open function is gathered from here: https://docs.python.org/3/library/functions.html#open

open(file, mode='r', buffering=-1, 
      encoding=None, errors=None, newline=None, 
      closefd=True, opener=None)

Encoding is the name of the encoding used to decode or encode the file. This should only be used in text mode. The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

So by adding encoding='utf-8' as a parameter to the open function, the file reading and writing is all done as utf8 (which is also now the default encoding of everything done in Python.)

Could you please elaborate more your answer adding a little more description about the solution you provide? — abarisone, Feb 10 '16 at 16:26
It looks this is available in python 2 using the codecs module - `codecs.open('somefile', encoding='utf-8')` http://stackoverflow.com/a/147756/149428 — Taylor D. Edmiston, Aug 14 '16 at 01:43

score 18 · Answer 4 · edited Sep 03 '22 at 21:56

18

This works for reading a file with UTF-8 encoding in Python 3.2:

import codecs
f = codecs.open('file_name.txt', 'r', 'UTF-8')
for line in f:
    print(line)

edited Sep 03 '22 at 21:56

Karl Knechtel

62,466
11
102
153

answered Aug 19 '14 at 08:09

Sina

431
4
7

score 18 · Answer 5 · edited Jan 04 '17 at 18:37

So, I've found a solution for what I'm looking for, which is:

print open('f2').read().decode('string-escape').decode("utf-8")

There are some unusual codecs that are useful here. This particular reading allows one to take UTF-8 representations from within Python, copy them into an ASCII file, and have them be read in to Unicode. Under the "string-escape" decode, the slashes won't be doubled.

This allows for the sort of round trip that I was imagining.

score 14 · Answer 6 · answered Feb 08 '12 at 20:24

# -*- encoding: utf-8 -*-

# converting a unknown formatting file in utf-8

import codecs
import commands

file_location = "jumper.sub"
file_encoding = commands.getoutput('file -b --mime-encoding %s' % file_location)

file_stream = codecs.open(file_location, 'r', file_encoding)
file_output = codecs.open(file_location+"b", 'w', 'utf-8')

for l in file_stream:
    file_output.write(l)

file_stream.close()
file_output.close()

score 9 · Answer 7 · edited Sep 03 '22 at 22:33

9

Aside from codecs.open(), io.open() can be used in both 2.x and 3.x to read and write text files. Example:

import io

text = u'á'
encoding = 'utf8'

with io.open('data.txt', 'w', encoding=encoding, newline='\n') as fout:
    fout.write(text)

with io.open('data.txt', 'r', encoding=encoding, newline='\n') as fin:
    text2 = fin.read()

assert text == text2

edited Sep 03 '22 at 22:33

Karl Knechtel

62,466
11
102
153

answered Jun 21 '17 at 09:37

Ryan

235
3
5

1

+1 [io is much better than codecs.](https://stackoverflow.com/questions/46437761/python2-7-codecs-openutf-8-fails-to-read-plain-ascii-file) – personal_cloud Sep 27 '17 at 20:58
Yes, using io is better; But I wrote the with statement like this `with io.open('data.txt', 'w', 'utf-8') as file:` and got an error: `TypeError: an integer is required`. After I changed to `with io.open('data.txt', 'w', encoding='utf-8') as file:` and it worked. – Evan Hu Jan 02 '18 at 05:33

score 6 · Answer 8 · answered Sep 18 '14 at 14:38

6

To read in an Unicode string and then send to HTML, I did this:

fileline.decode("utf-8").encode('ascii', 'xmlcharrefreplace')

Useful for python powered http servers.

answered Sep 18 '14 at 14:38

praj

61
1
1

score 6 · Answer 9 · edited Jan 04 '17 at 18:11

Well, your favorite text editor does not realize that \xc3\xa1 are supposed to be character literals, but it interprets them as text. That's why you get the double backslashes in the last line -- it's now a real backslash + xc3, etc. in your file.

If you want to read and write encoded files in Python, best use the codecs module.

Pasting text between the terminal and applications is difficult, because you don't know which program will interpret your text using which encoding. You could try the following:

>>> s = file("f1").read()
>>> print unicode(s, "Latin-1")
CapitÃ¡n

Then paste this string into your editor and make sure that it stores it using Latin-1. Under the assumption that the clipboard does not garble the string, the round trip should work.

score 6 · Answer 10 · edited May 23 '17 at 11:47

You have stumbled over the general problem with encodings: How can I tell in which encoding a file is?

Answer: You can't unless the file format provides for this. XML, for example, begins with:

<?xml encoding="utf-8"?>

This header was carefully chosen so that it can be read no matter the encoding. In your case, there is no such hint, hence neither your editor nor Python has any idea what is going on. Therefore, you must use the codecs module and use codecs.open(path,mode,encoding) which provides the missing bit in Python.

As for your editor, you must check if it offers some way to set the encoding of a file.

The point of UTF-8 is to be able to encode 21-bit characters (Unicode) as an 8-bit data stream (because that's the only thing all computers in the world can handle). But since most OSs predate the Unicode era, they don't have suitable tools to attach the encoding information to files on the hard disk.

The next issue is the representation in Python. This is explained perfectly in the comment by heikogerlach. You must understand that your console can only display ASCII. In order to display Unicode or anything >= charcode 128, it must use some means of escaping. In your editor, you must not type the escaped display string but what the string means (in this case, you must enter the umlaut and save the file).

That said, you can use the Python function eval() to turn an escaped string into a string:

>>> x = eval("'Capit\\xc3\\xa1n\\n'")
>>> x
'Capit\xc3\xa1n\n'
>>> x[5]
'\xc3'
>>> len(x[5])
1

As you can see, the string "\xc3" has been turned into a single character. This is now an 8-bit string, UTF-8 encoded. To get Unicode:

>>> x.decode('utf-8')
u'Capit\xe1n\n'

Gregg Lind asked: I think there are some pieces missing here: the file f2 contains: hex:

0000000: 4361 7069 745c 7863 335c 7861 316e  Capit\xc3\xa1n

codecs.open('f2','rb', 'utf-8'), for example, reads them all in a separate chars (expected) Is there any way to write to a file in ASCII that would work?

Answer: That depends on what you mean. ASCII can't represent characters > 127. So you need some way to say "the next few characters mean something special" which is what the sequence "\x" does. It says: The next two characters are the code of a single character. "\u" does the same using four characters to encode Unicode up to 0xFFFF (65535).

So you can't directly write Unicode to ASCII (because ASCII simply doesn't contain the same characters). You can write it as string escapes (as in f2); in this case, the file can be represented as ASCII. Or you can write it as UTF-8, in which case, you need an 8-bit safe stream.

Your solution using decode('string-escape') does work, but you must be aware how much memory you use: Three times the amount of using codecs.open().

Remember that a file is just a sequence of bytes with 8 bits. Neither the bits nor the bytes have a meaning. It's you who says "65 means 'A'". Since \xc3\xa1 should become "à" but the computer has no means to know, you must tell it by specifying the encoding which was used when writing the file.

I think there are some pieces missing here: the file f2 contains: hex: 0000000: 4361 7069 745c 7863 335c 7861 316e 0a Capit\xc3\xa1n. codecs.open('f2','rb', 'utf-8') , for example, reads them all in a separate chars (expected) Is there any way to write to a file in ascii that would work? — Gregg Lind, Jan 29 '09 at 17:21

score 4 · Answer 11 · edited Jan 04 '17 at 18:09

4

The \x.. sequence is something that's specific to Python. It's not a universal byte escape sequence.

How you actually enter in UTF-8-encoded non-ASCII depends on your OS and/or your editor. Here's how you do it in Windows. For OS X to enter a with an acute accent you can just hit option + E, then A, and almost all text editors in OS X support UTF-8.

edited Jan 04 '17 at 18:09

Peter Mortensen

30,738
21
105
131

answered Jan 29 '09 at 15:10

ʞɔıu

47,148
35
106
149

score 3 · Answer 12 · edited Jan 04 '17 at 18:47

3

You can also improve the original open() function to work with Unicode files by replacing it in place, using the partial function. The beauty of this solution is you don't need to change any old code. It's transparent.

import codecs
import functools
open = functools.partial(codecs.open, encoding='utf-8')

edited Jan 04 '17 at 18:47

Peter Mortensen

30,738
21
105
131

answered Dec 08 '16 at 03:22

hipertracker

2,425
26
16

score 1 · Answer 13 · edited Jan 04 '17 at 18:45

I was trying to parse iCal using Python 2.7.9:

from icalendar import Calendar

But I was getting:

 Traceback (most recent call last):
 File "ical.py", line 92, in parse
    print "{}".format(e[attr])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 7: ordinal not in range(128)

and it was fixed with just:

print "{}".format(e[attr].encode("utf-8"))

(Now it can print liké á böss.)

score -2 · Answer 14 · answered Dec 17 '19 at 14:49

I found the most simple approach by changing the default encoding of the whole script to be 'UTF-8':

import sys
reload(sys)
sys.setdefaultencoding('utf8')

any open, print or other statement will just use utf8.

Works at least for Python 2.7.9.

Thx goes to https://markhneedham.com/blog/2015/05/21/python-unicodeencodeerror-ascii-codec-cant-encode-character-uxfc-in-position-11-ordinal-not-in-range128/ (look at the end).

Unicode (UTF-8) reading and writing to files in Python

14 Answers14

Linked

Related