1

I'm working on a tool in maya where at some point, the user can enter a comment on the textField. This comment will later be used as part of the filename that's gonna be saved. I work in France so the user might use some accentuated characters as "é" or "à".
What i would love would be to just translate them to their non accentuated corresponding character. However I realise this is quite tricky so I would be ok with juste detecting them so I can issue a warning message to the user. I don't want to just strip the incriminated letters as it might result on the comment to be incomprensible.
I know they're some similar questions around here, but they're all on other languages I don't know/understand (such as C++ or php).

Here's what I found so far around the web :

import re
comment = 'something written with some french words and numbers'
if re.match(r'^[A-Za-z0-9_]+$', text):
    # issue a warning for the user

This first solution doesn't work because it considers accentuated characters as acceptable.

I found this :

ENGLISH_CHARS = re.compile('[^\W_]', re.IGNORECASE)
ALL_CHARS = re.compile('[^\W_]', re.IGNORECASE | re.UNICODE)

assert len(ENGLISH_CHARS.findall('_àÖÎ_')) == 0
assert len(ALL_CHARS.findall('_àÖÎ_')) == 3

which I thought about using like this :

ENGLISH_CHARS = re.compile('[^\W_]', re.IGNORECASE)
if len(ENGLISH_CHARS .findall(comment)) != len(comment):
    # issue a warning for the user

but it only seems to work if the string is encapsulated within underscores.

I'm really sorry if this a duplicate of something I haven't found or understood, but it's been driving me nuts.

  • It is naïve and risqué to call them "English" characters. You really mean *ASCII* characters. Using that as a search term may also yield better answers... – deceze Mar 17 '16 at 14:15
  • @deceze Okay, i'll modify the name of my request then. The reason is the first answer I found (shown above) was saying that it was filtering ASCII characters but still returns None so I'm actually not sur what ASCII really is... – tic-tac-orange Mar 17 '16 at 14:24

2 Answers2

0

The unicode command tries to encode your string in the given encoding. It will default to ASCII and raise an exception if it fails.

try:
    unicode(filename)
except UnicodeDecodeError:
    show_warning()

This only allows unaccented characters, which is maybe what you want.

If you already have an Unicode string, you have to change the encoding, which will raise an UnicodeEncodeError.

filename.encode("ASCII")

Example:

>>> unicode("ää")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
Schore
  • 885
  • 6
  • 16
0

It seems You have actually two questions.

  1. How to discover if conversion is needed from accented characters to 'similar' from ASCII.

    #coding: utf-8
    import string
    text = u"Montréal, über, 12.89, Mère, Françoise, noël, 889"
    allowed_letters = string.printable
    name_has_accented = [letter for letter in text if not letter in allowed_letters]
    if name_has_accented:
        text = "".join(convert(text))
    print(text)
    
  2. How to convert them easily to non accented? You could devise nice generic solutions, or You might do it for French only, quite easily like this:

    def convert(text):
        replacements = {
            u"à": "a",
            u"Ö": "o",
            u"é": "e",
            u"ü": "u",
            u"ç": "c",
            u"ë": "e",
            u"è": "e",
        }
        def convert_letter(letter):
            try:
                return replacements[letter]
            except KeyError:
                return letter
        return [convert_letter(letter) for letter in text]
    
JustMe
  • 710
  • 4
  • 16
  • Apparently it's not needed for the letters to be converted (says my boss, it prefers for the user to modify the input himself). But thanks for your answer as it drove me crazy for the entire day. – tic-tac-orange Mar 17 '16 at 15:02
  • I'm getting an error when I'm trying to use your solution, saying 'bool' object is not iterable. It comes from the libe when you try find accentuated character, and as I'm not familiar with that syntax I'm not quite sure how to fix it. – tic-tac-orange Mar 17 '16 at 16:27