119

In a text file, there is a string "I don't like this".

However, when I read it into a string, it becomes "I don\xe2\x80\x98t like this". I understand that \u2018 is the unicode representation of "'". I use

f1 = open (file1, "r")
text = f1.read()

command to do the reading.

Now, is it possible to read the string in such a way that when it is read into the string, it is "I don't like this", instead of "I don\xe2\x80\x98t like this like this"?

Second edit: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion?

Dzinx
  • 55,586
  • 10
  • 60
  • 78
Graviton
  • 81,782
  • 146
  • 424
  • 602
  • 2
    btw, your text file is broken! U+2018 is the "LEFT SINGLE QUOTATION MARK", not an apostrophe (U+0027 most commonly). –  Sep 30 '08 at 19:51
  • the thing is, you need to convert UNICODE to ASCII (not the other way around). – hasen Dec 08 '08 at 12:21
  • Some comments: I have seen some people use mapping to solve this problem, but really, is there no built-in conversion that does this kind of ANSI to unicode ( and vice versa) conversion? Thanks! – Graviton Sep 29 '08 at 07:11
  • There's not, because there are hundreds of thousands of Unicode code points. How would you decide which should be mapped to what ASCII characters? – John Millikin Sep 29 '08 at 07:25
  • john, your comment is wrong, at least in the general sense. the iconv lib can be used to transliterate unicode characters to ascii (even locale dependent. $ python -c 'print u"\u2018".encode("utf-8")' | iconv -t 'ascii//translit' | xxd 0000000: 270a –  Sep 30 '08 at 19:59

9 Answers9

189

Ref: http://docs.python.org/howto/unicode

Reading Unicode from a file is therefore simple:

import codecs
with codecs.open('unicode.rst', encoding='utf-8') as f:
    for line in f:
        print repr(line)

It's also possible to open files in update mode, allowing both reading and writing:

with codecs.open('test', encoding='utf-8', mode='w+') as f:
    f.write(u'\u4500 blah blah blah\n')
    f.seek(0)
    print repr(f.readline()[:1])

EDIT: I'm assuming that your intended goal is just to be able to read the file properly into a string in Python. If you're trying to convert to an ASCII string from Unicode, then there's really no direct way to do so, since the Unicode characters won't necessarily exist in ASCII.

If you're trying to convert to an ASCII string, try one of the following:

  1. Replace the specific unicode chars with ASCII equivalents, if you are only looking to handle a few special cases such as this particular example

  2. Use the unicodedata module's normalize() and the string.encode() method to convert as best you can to the next closest ASCII equivalent (Ref https://web.archive.org/web/20090228203858/http://techxplorer.com/2006/07/18/converting-unicode-to-ascii-using-python):

    >>> teststr
    u'I don\xe2\x80\x98t like this'
    >>> unicodedata.normalize('NFKD', teststr).encode('ascii', 'ignore')
    'I donat like this'
    
Gulzar
  • 23,452
  • 27
  • 113
  • 201
Jay
  • 41,768
  • 14
  • 66
  • 83
  • 4
    `codecs` module doesn't handle universal newlines mode properly. Use `io.open()` instead on Python 2.7+ (it is builtin `open()` on Python 3). – jfs Jun 05 '15 at 20:25
22

It is also possible to read an encoded text file using the python 3 read method:

f = open (file.txt, 'r', encoding='utf-8')
text = f.read()
f.close()

With this variation, there is no need to import any additional libraries

Stein
  • 719
  • 7
  • 9
15

There are a few points to consider.

A \u2018 character may appear only as a fragment of representation of a unicode string in Python, e.g. if you write:

>>> text = u'‘'
>>> print repr(text)
u'\u2018'

Now if you simply want to print the unicode string prettily, just use unicode's encode method:

>>> text = u'I don\u2018t like this'
>>> print text.encode('utf-8')
I don‘t like this

To make sure that every line from any file would be read as unicode, you'd better use the codecs.open function instead of just open, which allows you to specify file's encoding:

>>> import codecs
>>> f1 = codecs.open(file1, "r", "utf-8")
>>> text = f1.read()
>>> print type(text)
<type 'unicode'>
>>> print text.encode('utf-8')
I don‘t like this
Dzinx
  • 55,586
  • 10
  • 60
  • 78
6

But it really is "I don\u2018t like this" and not "I don't like this". The character u'\u2018' is a completely different character than "'" (and, visually, should correspond more to '`').

If you're trying to convert encoded unicode into plain ASCII, you could perhaps keep a mapping of unicode punctuation that you would like to translate into ASCII.

punctuation = {
  u'\u2018': "'",
  u'\u2019': "'",
}
for src, dest in punctuation.iteritems():
  text = text.replace(src, dest)

There are an awful lot of punctuation characters in unicode, however, but I suppose you can count on only a few of them actually being used by whatever application is creating the documents you're reading.

Logan
  • 1,884
  • 1
  • 12
  • 11
  • 1
    actually, if you make the dict map Unicode ordinals to Unicode ordinals ({0x2018: 0x27, 0x2019: 0x27}) you can just pass the entire dict to text.translate() to do all the replacing in one go. – Thomas Wouters Sep 29 '08 at 09:35
3

There is a possibility that somehow you have a non-unicode string with unicode escape characters, e.g.:

>>> print repr(text)
'I don\\u2018t like this'

This actually happened to me once before. You can use a unicode_escape codec to decode the string to unicode and then encode it to any format you want:

>>> uni = text.decode('unicode_escape')
>>> print type(uni)
<type 'unicode'>
>>> print uni.encode('utf-8')
I don‘t like this
Dzinx
  • 55,586
  • 10
  • 60
  • 78
3

Leaving aside the fact that your text file is broken (U+2018 is a left quotation mark, not an apostrophe): iconv can be used to transliterate unicode characters to ascii.

You'll have to google for "iconvcodec", since the module seems not to be supported anymore and I can't find a canonical home page for it.

>>> import iconvcodec
>>> from locale import setlocale, LC_ALL
>>> setlocale(LC_ALL, '')
>>> u'\u2018'.encode('ascii//translit')
"'"

Alternatively you can use the iconv command line utility to clean up your file:

$ xxd foo
0000000: e280 980a                                ....
$ iconv -t 'ascii//translit' foo | xxd
0000000: 270a                                     '.
1

Actually, U+2018 is the Unicode representation of the special character ‘ . If you want, you can convert instances of that character to U+0027 with this code:

text = text.replace (u"\u2018", "'")

In addition, what are you using to write the file? f1.read() should return a string that looks like this:

'I don\xe2\x80\x98t like this'

If it's returning this string, the file is being written incorrectly:

'I don\u2018t like this'
John Millikin
  • 197,344
  • 39
  • 212
  • 226
  • Sorry! As you said, it is returning 'I don\xe2\x80\x98t like this' – Graviton Sep 29 '08 at 06:59
  • The 'I don\xe2\x80\x98t like this' that you're seeing is what Python would call a str. It appears to be the utf-8 encoding of u'I don\u2018t like this', which is a unicode instance in Python. Try calling .decode('utf-8') on the former or .encode('utf-8') on the latter. – Logan Sep 29 '08 at 07:11
  • @hop: oops, forgot ord() returns decimal instead of hex. Thank you for the catch. – John Millikin Oct 01 '08 at 01:03
1

This is Pythons way do show you unicode encoded strings. But i think you should be able to print the string on the screen or write it into a new file without any problems.

>>> test = u"I don\u2018t like this"
>>> test
u'I don\u2018t like this'
>>> print test
I don‘t like this
xardias
  • 258
  • 1
  • 4
1

Not sure about the (errors="ignore") option but it seems to work for files with strange Unicode characters.

with open(fName, "rb") as fData:
    lines = fData.read().splitlines()
    lines = [line.decode("utf-8", errors="ignore") for line in lines]
nvd
  • 2,995
  • 28
  • 16