translate panda dataframe using dictionary sorted by word length

Question

I have imported an excel to a pandas dataframe, which I'm trying to translate and then export back to an excel.

For example purpose say this is my data set:

d = {"cool":"chill", "guy":"dude","cool guy":"bro"}```
data = [['cool guy'], ['cool'], ['guy']]
df = pd.DataFrame(data, columns = ['WORDS'])


print(df)
#    WORDS   
# 0  cool guy   
# 1  cool  
# 2  guy

So the easiest solution would be to use pandas built in function replace. However if you use:

df['WORDS'] = df['WORDS'].replace(d, regex=True)

The result is:

print(df)
#    WORDS   
# 0  chill dude   
# 1  chill  
# 2  dude

(cool guy doesn't get translated correctly)

This could be solved by sorting the dictionary by the longest word first. I tried to use this function:

import re
def replace_words(col, dictionary):
    # sort keys by length, in reverse order
    for item in sorted(dictionary.keys(), key = len, reverse = True):
        col = re.sub(item, dictionary[item], col)
    return col

But..

df['WORDS'] = replace_words(df['WORDS'], d)

Results in a type error: TypeError: expected string or bytes-like object

Trying to convert the row to a string did not help either

...*
col = re.sub(item, dictionary[item], [str(row) for row in col])

Does anyone have any solution or different approach I could try?

Unless I'm misunderstanding don't you just want `replace` without regex? `df['WORDS'] = df['WORDS'].replace(d)` — Henry Ecker, Jul 17 '21 at 18:11
I think you need `df['WORDS'].replace(dict(sorted(d.items(), key=lambda k: len(k[0]), reverse=True)), regex=True)`. — Henry Yik, Jul 17 '21 at 18:31
@HenryEcker I seem to have misunderstood the need for regex. Simply replace(d) was enough, as you said! Thank you! — Tomas Storås, Jul 17 '21 at 19:34

score 1 · Accepted Answer · answered Jul 17 '21 at 18:29

1

Let us try replace

df.WORDS.replace(d)
Out[307]: 
0      bro
1    chill
2     dude
Name: WORDS, dtype: object

answered Jul 17 '21 at 18:29

BENY

317,841
20
164
234

replace(d) was enough for it to work! Thank you! – Tomas Storås Jul 17 '21 at 19:37

ThePyGuy · Answer 2 · 2021-07-17T18:20:51.967

0

You can use Series.map and pass the dictionary. It will replace the values in the column from the dictionary if it is found in the key of the dictionary being passed.

>>> df['WORDS'].map(d)
0      bro
1    chill
2     dude
Name: WORDS, dtype: object

edited Jul 17 '21 at 18:20

answered Jul 17 '21 at 18:10

ThePyGuy

17,779
5
18
45

This doesn't work because If there is a word in the df that is not in the dictionary, it gets replaced with NaN. – Tomas Storås Jul 17 '21 at 18:55

score 0 · Answer 3 · answered Jul 17 '21 at 18:11

0

df['WORDS'] = df['WORDS'].apply(lambda x: d[x])

This will do the work.

answered Jul 17 '21 at 18:11

Diyar Mohammady

121
5

This doesn't work if there is a word in the df that is not in the dictionary. It causes a key error. – Tomas Storås Jul 17 '21 at 18:53

translate panda dataframe using dictionary sorted by word length

3 Answers3