remove special characters from string

Question

i have a string "Mikael Håfström" which contains some special characters how do i remove this using python?

Is your string a unicode string? Do you want to remove the characters or rather replace by "standard" characters? — Sven Marnach, Mar 10 '11 at 10:54
Related: [What is the best way to remove accents in a python unicode string?](http://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) — Sven Marnach, Mar 10 '11 at 10:59

Filip Dupanović · Answer 1 · 2011-03-10T21:19:54.840

13

You can use the unicodedata module to normalize unicode strings and encode them in their ASCII form like so:

>>> import unicodedata
>>> source = u'Mikael Håfström'
>>> unicodedata.normalize('NFKD', source).encode('ascii', 'ignore')
'Mikael Hafstrom'

One notable exception is that the letters 'đ' and 'Đ' are not recognized by Python and they do not get encoded to 'd', so they will simply be omitted from the result. That's a voiced alveolo-palatal affricate present in the latin alphabet of some SEE languages, so it may or may not immediately concern you based on your audience or whether or not your providing full support for the Latin-1 character set. I currently have Python 2.6.5 (Mar 19 2010) running locally and the issue is present, though I'm sure it may have been resolved with newer releases.

edited Mar 10 '11 at 21:19

answered Mar 10 '11 at 11:20

Filip Dupanović

32,650
13
84
114

I can't imagine why you call those two IPA letters a "notable" exception. BTW: s/alveolar-palatal/dental/. There are several letters and ligatures in Latin-1 which don't normalise to ASCII. Maybe you are confusing those two with Eth, used in Icelandic, Faroese, and Elfdalian. In any case the NFKD gimmick needs augmentation with a built-in list of exceptions. See my answer. – John Machin Mar 10 '11 at 11:58
For South Slavic languages, which I use, I did notice that there was no support for 'Dj', 'dj', 'Đ' or 'đ', hence the notable exception ;)! Truth be told, I don't have any experience with northern Germanic languages, and this is a rather specialized topic, so I'm a bit grizzled that there are 9 exceptions listed on effbot. Has any of this been corrected to date? – Filip Dupanović Mar 10 '11 at 12:15
1

Dupanovic: "issue"? "support"? "corrected"? You appear to mis-understand the purpose of the NFKD normalisation -- it's to decompose Unicode characters, if possible, not for the purpose of smashing them into ASCII. For some it's not possible; they don't decompose. No correction is required. All of the `unicodedata` functions get their data directly from tables provided by unicode.org. There is no "issue". – John Machin Mar 10 '11 at 19:05
I've finally noticed my slip where I said normalized, instead of encoded, that seems to have brought much confusion. Thanks John, your affable didactics were much appreciated! – Filip Dupanović Mar 10 '11 at 21:18

filmor · Answer 2 · 2011-06-05T21:26:53.847

5

For example using the encode method: u"Mikael Håfström".encode("ascii", "ignore")

edited Jun 05 '11 at 21:26

answered Mar 10 '11 at 11:17

filmor

30,840
6
50
48

your method just throw an exception, and return 'Mikael Hfstrm' if you add unicode as input encoding. – toutpt May 30 '11 at 08:44

John Machin · Answer 3 · 2011-03-10T19:07:55.807

1

See this effbot article (includes code). It makes reasonable transliterations into ASCII characters where possible. It is possible to extend the built-in conversion table to handle many other characters (e.g. those used in Eastern European languages) that don't have a canonical decomposition.

edited Mar 10 '11 at 19:07

answered Mar 10 '11 at 11:51

John Machin

81,303
11
141
189

remove special characters from string

3 Answers3

Linked