Maintaining the consistency of strings before and after converting to ASCII

Question

I have many strings in unicode format such as carbon copolymers—III\n12- Géotechnique\n and many more having many different unicode characters, in a string variable named txtWords.

My goal is to remove all non-ASCII characters while preserving the consistency of the strings. For instance I want to first sentence turn into carbon copolymers III or carbon copolymers iii (no case-sensitivity here) and the second one to geotechnique\n and so on ...

Currently I am using the following code but it doesn't help me achieve what I expect. The current code changes carbon copolymers III to carbon copolymersiii which is definitely not what it should be:

import unicodedata, re
txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')
txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)

If I use the regex code first then I get something worse (in terms of what I expect):

    import unicodedata, re
    txtWords = re.sub(r'[^a-z^\n]',r' ',txtWords)
    txtWords = unicodedata.normalize('NFKD', txtWords.lower()).encode('ascii','ignore')

This way, for the string Géotechnique\n I get otechnique!

How can I resolve this issue?

This depends on why you need that reduction to ASCII. For your regex to be useful, you have to first apply the regex and then encode to ASCII — roeland, Nov 30 '15 at 01:23
@roeland Well if I apply regex first then I get something even worse: **Géotechnique\n** will become **otechnique** — Amir, Nov 30 '15 at 01:28
@roeland And well I need that reduction since I'm matching those words with a ASCII-based words data base that I have. — Amir, Nov 30 '15 at 01:34
@Amir That's not what I meant. I was talking about the `.encode('ascii','ignore')` call. This call strips away non-ascii characters like the en-dash (—). So apply that call last. I'm guessing you got that code fragment [here](http://www.peterbe.com/plog/unicode-to-ascii). So I think you need to pause for a second and understand what each of these calls do. — roeland, Nov 30 '15 at 01:39
You could apply the regex first by targeting just the characters you're interested in. Positive class is `[\x00-\x09\x0b\x0c\x0e-@\[-\`{-\x7f]` negative class is `[^a-zA-Z\x{80}-\x{10ffff}\r\n]` both do the same thing. — , Nov 30 '15 at 01:47
@Amir You shouldn't change the title to resolved. Accepting an answer should suffice. More info in the meta post http://meta.stackoverflow.com/q/285390/5290909 — Mariano, Dec 02 '15 at 13:41

Mark Tolonen · Accepted Answer · 2015-12-01T15:43:19.927

1

Use the \w regular expression to strip non-alphanumerics before the decomposing trick:

#coding:utf8
from __future__ import unicode_literals,print_function
import unicodedata as ud
import re
txtWords = 'carbon copolymers—III\n12- Géotechnique\n'
txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower(),flags=re.U)
txtWords = ud.normalize('NFKD',txtWords).encode('ascii','ignore').decode()
print(txtWords)

Output (Python 2 and 3):

carbon copolymers iii
12  geotechnique

edited Dec 01 '15 at 15:43

answered Nov 30 '15 at 19:03

Mark Tolonen

166,664
26
169
251

It doesn't work. Are you sure this is the correct code? After I execute txtWords = re.sub(r'[^\w\n]',r' ',txtWords.lower()) I get **'carbon copolymers iii\n12 g otechnique\n'** – Amir Dec 01 '15 at 02:12
@Amir, you didn't specify a Python version. I was using Python 3. Updated to work in both. `txtWords` needs to be a Unicode string to start with, and for `\w` to work correctly in Python 2, `flags=re.U` is required for the `re.sub`. – Mark Tolonen Dec 01 '15 at 02:23
@Amir, the code was correct so I rolled back your edit. The `from __future__ import unicode_literals` handles that and is portable to Python 3. – Mark Tolonen Dec 01 '15 at 15:47
Well if you don't convert the string to unicode in the first place, then the normalization code does not work (at least for me) – Amir Dec 02 '15 at 00:53
1

@Amir, it *is* Unicode if you have `from __future__ import unicode_literals`. You can also just make it a Unicode string without it by using `u'carbon...'`. You may need it in your code if you don't have that, but it isn't needed in mine. In fact, your edit gives an error in Python 2 of `decoding Unicode not supported`. – Mark Tolonen Dec 02 '15 at 02:49
Can you please take a look at [here](http://stackoverflow.com/questions/34034225/remove-unicode-symbols-while-preserving-string-consistency)? – Amir Dec 02 '15 at 03:11

Maintaining the consistency of strings before and after converting to ASCII

1 Answers1

Linked

Related