Can't decode accent properly in pandas dataframe

Question

I am trying to make a bar chart race for Dataphile (my YouTube channel) with "Judo Athletes with most Olympic medals". Here's my problem: some athletes have accents in their names in my dataset (csv) and I can't decode them properly.

For example, in my dataset at line 5, the ahtlete's name is "Andreas TÃƒÂ¶lzer".

Here is my code:

years = [str(y) for y in range(1972,2020, 4)]
sex = ["mens", "womens"]
cat = ["extra-lightweight", "lightweight", "half-lightweight", "half-middleweight", "middleweight", "half-heavyweight", "heavyweight", "open-class"]

df_results = pd.DataFrame(columns=["Athlete"] + years)

all_df = {}

for s in sex: # gets all sexes
    for c in cat: #gets all weight categories
        for y in years: # gets all years with summer olympics
            try:
                all_df[y] = pd.read_csv(r"C:\Users\joris\Coding\judo_olympics\olympics_summer_" + y + "_JUD_" + s + "-" + c +"_final_standings.csv")
                df_med = all_df[y].head(4)[["Athlete"]]
                iter_years = iter(years)
                for w in years:
                    if int(w) >= int(y):
                        df_med.insert(len(df_med.columns), w, 1)
                    else:
                        df_med.insert(len(df_med.columns), w, 0)
                df_results = df_results.append(df_med)
            except FileNotFoundError:
                pass    
df_results = df_results.groupby("Athlete").sum()

df_results.index = df_results.index.str.normalize('NFKD').str.encode('ascii', errors='ignore').str.decode('utf-8') # got that from the internet

Here, we can see that the athlete's name has not been decoded properly in the output.

What I would like is to simply change letters with accent to the same letter without accent (example: "é" would become "e").

There should be no letter from other alphabets in my datasets, only annoying accents.

Please let me know if you have a solution or if you need more info from my code.

Thanks !

you appear to be encoding Unicode from your CSV into ASCII in the last line. accents don't work in ASCII. — Dave Kielpinski, Mar 25 '20 at 20:11
Check out this answer: [https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) — jammin0921, Mar 25 '20 at 20:12
Your input dataset is a mess. It is decoded wrong twice. `'TÃƒÂ¶lzer'.encode('cp1252').decode('utf8').encode('cp1252').decode('utf8') -> 'Tölzer'`. Read the original data as UTF-8 and to display it correctly in Excel encode the CSV file with `'utf-8-sig'`. — Mark Tolonen, Mar 26 '20 at 18:17

score 0 · Answer 1 · answered Mar 25 '20 at 20:47

0

There is a python package Unidecode that you can use for this.

pip install --user unidecode

Then, in Python:

>>> from unidecode import unidecode
>>> print(unidecode('Ölfäßchen'))
'Olfasschen'

answered Mar 25 '20 at 20:47

Johnny Wezel

91
8

But `unidecode` won't fix garbled text for you. Check: `unidecode.unidecode('TÃƒÂ¶lzer')` returns `'TAfAPlzer'`, not `'Tolzer'` for the original name "Tölzer". – lenz Mar 25 '20 at 22:08
Exactly, as noted by lenz, I still don't get the expected result. – Joris Limonier Mar 26 '20 at 14:21
Yes, what you've got is a string decoded in a wrong way. Of course that's not covered. But there is even a python package for that: ftfy. But first you should try to decode Unicode properly. – Johnny Wezel Mar 26 '20 at 15:39

Can't decode accent properly in pandas dataframe

1 Answers1