Replacing non-ASCII characters in a string

Asked Aug 12 '19 at 15:13

Active Aug 12 '19 at 15:26

Viewed 36 times

I'm currently trying to replace specific characters that occur in a string, I've looked through many related posts regarding a similar issue, but the ones I've found just want to remove them entirely.

The full code is supposed to find most common n-grams and word freqs, however the unique character are throwing it off:

This is the code I've written to handle it but its currently not working:

words = ['vãhãn', 'chairã', 'â'] 

def ASCII_fix(words):
    """
    :param words: list of unfiltered words that have different encoding
    :return: formatted utf-8 list
    """
    for x in words:
        word = x
        for a in word:
            unidecode(a)
    return words

words = ASCII_fix(words)

Output: [('â', 'â', 'â'), ('ã', 'â', 'ã'), ('ã', 'ã', 'ã')]

If this is a dupe just let me know, or if there is a handy package that can help with this that'd be great!

edited Aug 12 '19 at 15:26

asked Aug 12 '19 at 15:13

Sebastian Goslin

1

[Unidecode](https://pypi.org/project/Unidecode/) – Ry- Aug 12 '19 at 15:15
1

@AndrejKesely: Not the same thing. – Ry- Aug 12 '19 at 15:16
@Ry- Trying out the module now – Sebastian Goslin Aug 12 '19 at 15:18
@Ry- testing Unidecode now looking at the docs for it and its still not working I'm updating the code in the question so you can see. – Sebastian Goslin Aug 12 '19 at 15:25
2

Strings are immutable, so `unidecode(a)` returns a new string that you throw out. It also operates on an entire string, so `def ASCII_fix(words): return [unidecode(word) for word in words]`. – Ry- Aug 12 '19 at 16:13
@Ry- I see, testing now – Sebastian Goslin Aug 12 '19 at 16:20
@Ry- That worked! I ended up writing the same thing instead of using a list comprehension I just wrote a regular loop but thats perfect thank you! – Sebastian Goslin Aug 12 '19 at 16:24

Replacing non-ASCII characters in a string

0 Answers0