0

I have many strings in unicode format such as carbon copolymers—III\n12- Géotechnique\n and many more having many different unicode characters, in a string variable named txtWords.

My goal is to remove all non-ASCII characters while preserving the consistency of the strings. For instance I want to first sentence turn into carbon copolymers III or carbon copolymers iii (no case-sensitivity here) and the second one to geotechnique\n and so on ...

Currently I am using the following code but it doesn't help me achieve what I expect. The current code changes carbon copolymers III to carbon copolymersiii which is definitely not what it should be:

import unicodedata, re
txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')
txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)

If I use the regex code first then I get something worse (in terms of what I expect):

    import unicodedata, re
    txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)
    txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')

This way, for the string Géotechnique\n I get otechnique!

How can I resolve this issue?

Mariano
  • 6,423
  • 4
  • 31
  • 47
Amir
  • 10,600
  • 9
  • 48
  • 75
  • See Unidecode: https://pypi.python.org/pypi/Unidecode/ – Mark Ransom Nov 30 '15 at 01:20
  • This depends on why you need that reduction to ASCII. For your regex to be useful, you have to first apply the regex and then encode to ASCII – roeland Nov 30 '15 at 01:23
  • @roeland Well if I apply regex first then I get something even worse: **Géotechnique\n** will become **otechnique** – Amir Nov 30 '15 at 01:28
  • @roeland And well I need that reduction since I'm matching those words with a ASCII-based words data base that I have. – Amir Nov 30 '15 at 01:34
  • 3
    @Amir That's not what I meant. I was talking about the `.encode('ascii','ignore')` call. This call strips away non-ascii characters like the en-dash (—). So apply that call last. I'm guessing you got that code fragment [here](http://www.peterbe.com/plog/unicode-to-ascii). So I think you need to pause for a second and understand what each of these calls do. – roeland Nov 30 '15 at 01:39
  • 1
    You could apply the regex first by targeting just the characters you're interested in. Positive class is `[\x00-\x09\x0b\x0c\x0e-@\[-\`{-\x7f]` negative class is `[^a-zA-Z\x{80}-\x{10ffff}\r\n]` both do the same thing. –  Nov 30 '15 at 01:47
  • 1
    @Amir You shouldn't change the title to resolved. Accepting an answer should suffice. More info in the meta post http://meta.stackoverflow.com/q/285390/5290909 – Mariano Dec 02 '15 at 13:41

1 Answers1

1

Use the \w regular expression to strip non-alphanumerics before the decomposing trick:

#coding:utf8
from __future__ import unicode_literals,print_function
import unicodedata as ud
import re
txtWords = 'carbon copolymers—III\n12- Géotechnique\n'
txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower(),flags=re.U)
txtWords = ud.normalize('NFKD',txtWords).encode('ascii','ignore').decode()
print(txtWords)

Output (Python 2 and 3):

carbon copolymers iii
12  geotechnique
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • It doesn't work. Are you sure this is the correct code? After I execute txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower()) I get **'carbon copolymers iii\n12 g otechnique\n'** – Amir Dec 01 '15 at 02:12
  • @Amir, you didn't specify a Python version. I was using Python 3. Updated to work in both. `txtWords` needs to be a Unicode string to start with, and for `\w` to work correctly in Python 2, `flags=re.U` is required for the `re.sub`. – Mark Tolonen Dec 01 '15 at 02:23
  • @Amir, the code was correct so I rolled back your edit. The `from __future__ import unicode_literals` handles that and is portable to Python 3. – Mark Tolonen Dec 01 '15 at 15:47
  • Well if you don't convert the string to unicode in the first place, then the normalization code does not work (at least for me) – Amir Dec 02 '15 at 00:53
  • 1
    @Amir, it *is* Unicode if you have `from __future__ import unicode_literals`. You can also just make it a Unicode string without it by using `u'carbon...'`. You may need it in your code if you don't have that, but it isn't needed in mine. In fact, your edit gives an error in Python 2 of `decoding Unicode not supported`. – Mark Tolonen Dec 02 '15 at 02:49
  • Can you please take a look at [here](http://stackoverflow.com/questions/34034225/remove-unicode-symbols-while-preserving-string-consistency)? – Amir Dec 02 '15 at 03:11