How to detect ASCII characters on a string in python

Question

I'm working on a tool in maya where at some point, the user can enter a comment on the textField. This comment will later be used as part of the filename that's gonna be saved. I work in France so the user might use some accentuated characters as "é" or "à".
What i would love would be to just translate them to their non accentuated corresponding character. However I realise this is quite tricky so I would be ok with juste detecting them so I can issue a warning message to the user. I don't want to just strip the incriminated letters as it might result on the comment to be incomprensible.
I know they're some similar questions around here, but they're all on other languages I don't know/understand (such as C++ or php).

Here's what I found so far around the web :

import re
comment = 'something written with some french words and numbers'
if re.match(r'^[A-Za-z0-9_]+$', text):
    # issue a warning for the user

This first solution doesn't work because it considers accentuated characters as acceptable.

I found this :

ENGLISH_CHARS = re.compile('[^\W_]', re.IGNORECASE)
ALL_CHARS = re.compile('[^\W_]', re.IGNORECASE | re.UNICODE)

assert len(ENGLISH_CHARS.findall('_àÖÎ_')) == 0
assert len(ALL_CHARS.findall('_àÖÎ_')) == 3

which I thought about using like this :

ENGLISH_CHARS = re.compile('[^\W_]', re.IGNORECASE)
if len(ENGLISH_CHARS .findall(comment)) != len(comment):
    # issue a warning for the user

but it only seems to work if the string is encapsulated within underscores.

I'm really sorry if this a duplicate of something I haven't found or understood, but it's been driving me nuts.

It is naïve and risqué to call them "English" characters. You really mean *ASCII* characters. Using that as a search term may also yield better answers... — deceze, Mar 17 '16 at 14:15
@deceze Okay, i'll modify the name of my request then. The reason is the first answer I found (shown above) was saying that it was filtering ASCII characters but still returns None so I'm actually not sur what ASCII really is... — tic-tac-orange, Mar 17 '16 at 14:24

Schore · Answer 1 · 2016-03-17T16:38:39.837

0

The unicode command tries to encode your string in the given encoding. It will default to ASCII and raise an exception if it fails.

try:
    unicode(filename)
except UnicodeDecodeError:
    show_warning()

This only allows unaccented characters, which is maybe what you want.

If you already have an Unicode string, you have to change the encoding, which will raise an UnicodeEncodeError.

filename.encode("ASCII")

Example:

>>> unicode("ää")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

edited Mar 17 '16 at 16:38

answered Mar 17 '16 at 14:19

Schore

885
6
16

Thanks. In the documentation you linked I found this u.encode('ascii', 'ignore') which might also be quite useful. – tic-tac-orange Mar 17 '16 at 14:36
Plus it also accepts numbers which is what I was looking for. I need to get rid of the underscores however but I think I can achieve that using re – tic-tac-orange Mar 17 '16 at 14:38
Okay so the strings appear to be encoded as unicode already in maya python. So when you access them and store them in a variable, there's no error when trying to convert them. – tic-tac-orange Mar 17 '16 at 16:14
See my updated answer – Schore Mar 17 '16 at 16:22
1

works like a charm now (the error is UnicodeEncodeError though) – tic-tac-orange Mar 17 '16 at 16:35
Oh you are right! I'll fix that :) – Schore Mar 17 '16 at 16:37

JustMe · Accepted Answer · 2016-03-20T13:35:46.193

It seems You have actually two questions.

How to discover if conversion is needed from accented characters to 'similar' from ASCII.

#coding: utf-8
import string
text = u"Montréal, über, 12.89, Mère, Françoise, noël, 889"
allowed_letters = string.printable
name_has_accented = [letter for letter in text if not letter in allowed_letters]
if name_has_accented:
    text = "".join(convert(text))
print(text)

How to convert them easily to non accented? You could devise nice generic solutions, or You might do it for French only, quite easily like this:

def convert(text):
    replacements = {
        u"à": "a",
        u"Ö": "o",
        u"é": "e",
        u"ü": "u",
        u"ç": "c",
        u"ë": "e",
        u"è": "e",
    }
    def convert_letter(letter):
        try:
            return replacements[letter]
        except KeyError:
            return letter
    return [convert_letter(letter) for letter in text]

Apparently it's not needed for the letters to be converted (says my boss, it prefers for the user to modify the input himself). But thanks for your answer as it drove me crazy for the entire day. — tic-tac-orange, Mar 17 '16 at 15:02
I'm getting an error when I'm trying to use your solution, saying 'bool' object is not iterable. It comes from the libe when you try find accentuated character, and as I'm not familiar with that syntax I'm not quite sure how to fix it. — tic-tac-orange, Mar 17 '16 at 16:27

How to detect ASCII characters on a string in python

2 Answers2