Replacing strings in a a column of similar categories, map to new column in python

Question

I have an existing dataframe (coffee_directions_df) that looks like the following

coffee_directions_df

Utterance                         Frequency   

Directions to Starbucks           1045
Directions to Tullys              1034
Give me directions to Tullys      986
Directions to Seattles Best       875
Show me directions to Dunkin      812
Directions to Daily Dozen         789
Show me directions to Starbucks   754
Give me directions to Dunkin      612
Navigate me to Seattles Best      498
Display navigation to Starbucks   376
Direct me to Starbucks            201

The DF shows utterances made by people and the frequency of utterances.

I.e., "Directions to Starbucks" was uttered 1045 times.

I am trying to figure out how to replace similar words, such as "Starbucks", "Tullys", "Seattles Best" in the column coffee_directions_df.Utterance to one string, such as "Coffee". I've seen similar answers that suggest a dictionary, such as the following, but I haven't had success yet.

{'Utterance':['Starbucks','Tullys','Seattles Best'],
      'Combi_Utterance':['Coffee','Coffee','Coffee','Coffee']}

{'Utterance':['Dunkin','Daily Dozen'],
      'Combi_Utterance':['Donut','Donut']}

{'Utterance':['Give me','Show me','Navigate me','Direct me'],
      'Combi_Utterance':['V_me','V_me','V_me','V_me']}

The desired output is as the following:

coffee_directions_df

Utterance                         Frequency  Combi_Utterance
Directions to Starbucks           1045       Directions to Coffee
Directions to Tullys              1034       Directions to Coffee
Give me directions to Tullys      986        V_me to Coffee
Directions to Seattles Best       875        Directions to Coffee
Show me directions to Dunkin      812        V_me to Donut
Directions to Daily Dozen         789        Directions to Donut
Show me directions to Starbucks   754        V_me to Coffee
Give me directions to Dunkin      612        V_me to Donut
Navigate me to Seattles Best      498        V_me to Coffee
Display navigation to Starbucks   376        Display navigation to Coffee
Direct me to Starbucks            201        V_me to Coffee

Ultimately, I want to be able to use this code that I have to produce the final output.

df = (df.set_index('Frequency')['Utterance']
        .str.split(expand=True)
        .stack()
        .reset_index(name='Words')
        .groupby('Words', as_index=False)['Frequency'].sum()
        )

print (df)
         Words  Frequency
0   Directions       6907
1         V_me       3863
2        Donut       2213
3       Coffee       5769
4        Other        376

Thanks!!

score 1 · Accepted Answer · answered Mar 31 '18 at 18:50

Below is one way. As per your previous questions, I choose to use collections.Counter instead of pandas for your counting logic.

The input required is in the form of a mapping dictionary rep_dict. We apply this to substrings of strings within the df['Utterance'] series.

from collections import Counter
import pandas as pd

df = pd.DataFrame([['Directions to Starbucks', 1045],
                   ['Show me directions to Starbucks', 754],
                   ['Give me directions to Starbucks', 612],
                   ['Navigate me to Starbucks', 498],
                   ['Display navigation to Starbucks', 376],
                   ['Direct me to Starbucks', 201],
                   ['Navigate to Starbucks', 180]],
                  columns=['Utterance', 'Frequency'])

# define dictionary of mappings
rep_dict = {'Starbucks': 'Coffee', 'Tullys': 'Coffee', 'Seattles Best': 'Coffee'}

# apply substring mapping
df['Utterance'] = df['Utterance'].replace(rep_dict, regex=True).str.lower()

# previous logic below
c = Counter()

for row in df.itertuples():
    for i in row[1].split():
        c[i] += row[2]

res = pd.DataFrame.from_dict(c, orient='index')\
                  .rename(columns={0: 'Count'})\
                  .sort_values('Count', ascending=False)

def add_combinations(df, lst):
    for i in lst:
        words = '_'.join(i)
        df.loc[words] = df.loc[df.index.isin(i), 'Count'].sum()
    return df.sort_values('Count', ascending=False)

lst = [('give', 'show', 'navigate', 'direct')]

res = add_combinations(res, lst)

Result

                           Count
to                          3666
coffee                      3666
directions                  2411
give_show_navigate_direct   2245
me                          2065
show                         754
navigate                     678
give                         612
display                      376
navigation                   376
direct                       201

hi, please take a look at my next question if you have some time. always appreciate your help! (follows your steps but i'm trying to do something else also). thanks! https://stackoverflow.com/questions/49656991/importing-txt-file-to-replace-certain-strings-in-a-dataframe-pandas — user_seaweed, Apr 04 '18 at 17:27
@user_seaweed, please accept this answer if it works (green tick on left). — jpp, Apr 04 '18 at 17:30
thanks, sorry still new at this. basically looking for a new way to do this with a bigger data frame. (so ideally, would like to import rep_dict as a txt file rather than writing it all out in my shell/terminal). Thanks! — user_seaweed, Apr 05 '18 at 01:11
@user_seaweed, please see [What should I do when someone answers my question?](https://stackoverflow.com/help/someone-answers) — jpp, Apr 05 '18 at 08:08

Replacing strings in a a column of similar categories, map to new column in python

1 Answers1