Approximately converting unicode string to ascii string in python

Question

don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?

For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?

Thank you very much! Marco

see this [http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database] — Facundo Casco, Nov 10 '11 at 22:52
What you are trying to achieve is not something desirable. You may endue having to add new replacements all the time. If would be really nice if you could explain why is this needed and why you must use ASCII instead of Unicode. — sorin, Nov 10 '11 at 22:53
@sorin: Not if you use an utility that already has replacements for all Unicode characters. — Petr Viktorin, Nov 11 '11 at 08:59

Petr Viktorin · Answer 1 · 2011-11-10T23:19:27.890

38

Use the Unidecode package to transliterate the string.

>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"

edited Nov 10 '11 at 23:19

answered Nov 10 '11 at 22:49

Petr Viktorin

65,510
9
81
81

1

Just installed it.. but.. >>> import unidecode >>> unidecode.unidecode(u'Gavin O’Connor') >>> "Gavin OConnor" – Marco Moschettini Nov 10 '11 at 23:30
1

It means that `’` is a Unicode character, and there is no ASCII equivalent. `’` is not `'`, at least according to Python. You may want to make a dictionary of special characters like these and store a similar looking ASCII character. Then you can just replace the Unicode characters with the corresponding ASCII ones. – D K Nov 11 '11 at 01:43

score 11 · Answer 2 · answered Nov 10 '11 at 22:48

11

import unicodedata

unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')

Output:

Gavin O'Connor

Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/

answered Nov 10 '11 at 22:48

Acorn

49,061
27
133
172

1

That's just going to remove the apostrophe from the example input string. OP was looking for a way to replace it with the "close enough" ascii single quote character. – slowdog Nov 10 '11 at 22:53
Hmm, on my machine it gives the above output, but when attempting the same thing elsewhere the apostrophe is just removed.. odd. – Acorn Nov 10 '11 at 23:03
1

With my python 2.6.6, `unicodedata.normalize('NFKD', u'Gavin O\u2019Connor') == u'Gavin O\u2019Connor'`, and `u'Gavin O\u2019Connor'.encode('ascii', 'ignore') == 'Gavin OConnor'`. I am beyond baffled by the standard you linked to, so I can't tell if that's a bug of `unicodedata.normalize`, or correct behaviour. – slowdog Nov 10 '11 at 23:15
In 2.6.5 `unicodedata.normalize('NFKD', u"Gavin O’Connor").encode('ascii','ignore')` gives me `"Gavin O'Connor"` – Acorn Nov 10 '11 at 23:46

score 8 · Answer 3 · answered Nov 10 '11 at 22:50

8

b = str(a.encode('utf-8').decode('ascii', 'ignore'))

should work fine.

answered Nov 10 '11 at 22:50

D K

5,530
7
31
45

It doesn't work. It just removes all the non-ASCII characters when I try it. – sudo May 09 '18 at 15:43

score 2 · Answer 4 · answered Nov 10 '11 at 22:47

2

There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm

answered Nov 10 '11 at 22:47

Mark Tolonen

166,664
26
169
251

score -2 · Answer 5 · answered Jan 01 '18 at 12:33

-2

Try simple character replacement

str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))

PS: add # -*- coding: utf-8 -*- to the top of your .py file if you get error

answered Jan 01 '18 at 12:33

Ritwik

521
7
17

2

There are many other commonly used Unicode characters with similar-looking ASCII versions, like the various dashes and hyphens. It's too hard to do all that manually. – sudo May 09 '18 at 15:44

Approximately converting unicode string to ascii string in python

5 Answers5

Linked

Related