28

don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?

For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?

Thank you very much! Marco

Marco Moschettini
  • 1,555
  • 2
  • 16
  • 26
  • 3
    see this [http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database] – Facundo Casco Nov 10 '11 at 22:52
  • 1
    What you are trying to achieve is not something desirable. You may endue having to add new replacements all the time. If would be really nice if you could explain why is this needed and why you must use ASCII instead of Unicode. – sorin Nov 10 '11 at 22:53
  • @sorin: Not if you use an utility that already has replacements for all Unicode characters. – Petr Viktorin Nov 11 '11 at 08:59

5 Answers5

38

Use the Unidecode package to transliterate the string.

>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"
Petr Viktorin
  • 65,510
  • 9
  • 81
  • 81
  • 1
    Just installed it.. but.. >>> import unidecode >>> unidecode.unidecode(u'Gavin O’Connor') >>> "Gavin OConnor" – Marco Moschettini Nov 10 '11 at 23:30
  • 1
    It means that `’` is a Unicode character, and there is no ASCII equivalent. `’` is not `'`, at least according to Python. You may want to make a dictionary of special characters like these and store a similar looking ASCII character. Then you can just replace the Unicode characters with the corresponding ASCII ones. – D K Nov 11 '11 at 01:43
11
import unicodedata

unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')

Output:

Gavin O'Connor

Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/

Acorn
  • 49,061
  • 27
  • 133
  • 172
  • 1
    That's just going to remove the apostrophe from the example input string. OP was looking for a way to replace it with the "close enough" ascii single quote character. – slowdog Nov 10 '11 at 22:53
  • Hmm, on my machine it gives the above output, but when attempting the same thing elsewhere the apostrophe is just removed.. odd. – Acorn Nov 10 '11 at 23:03
  • 1
    With my python 2.6.6, `unicodedata.normalize('NFKD', u'Gavin O\u2019Connor') == u'Gavin O\u2019Connor'`, and `u'Gavin O\u2019Connor'.encode('ascii', 'ignore') == 'Gavin OConnor'`. I am beyond baffled by the standard you linked to, so I can't tell if that's a bug of `unicodedata.normalize`, or correct behaviour. – slowdog Nov 10 '11 at 23:15
  • In 2.6.5 `unicodedata.normalize('NFKD', u"Gavin O’Connor").encode('ascii','ignore')` gives me `"Gavin O'Connor"` – Acorn Nov 10 '11 at 23:46
8
b = str(a.encode('utf-8').decode('ascii', 'ignore'))

should work fine.

D K
  • 5,530
  • 7
  • 31
  • 45
2

There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
-2

Try simple character replacement

str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))

PS: add # -*- coding: utf-8 -*- to the top of your .py file if you get error

Ritwik
  • 521
  • 7
  • 17
  • 2
    There are many other commonly used Unicode characters with similar-looking ASCII versions, like the various dashes and hyphens. It's too hard to do all that manually. – sudo May 09 '18 at 15:44