How to calculate a column using the most common words calculated from another dataframe in Python?

Question

Example of the dataframe:

cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'], 
'Days_Sold': [3, 4, 1, 2, 2, 1]}

cake = pd.DataFrame(data=cup)

cake

I calculated the most common words of the dataframe (with stop words removed)

from collections import Counter

Counter(" ".join(cake['Description']).split()).most_common()

I put this into a new dataframe and reset the index

count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())

count.columns = ['Words', 'Values']

count.index= np.arange(1, len(count)+1)

count.head()

The Values is in the 'count' dataframe. The Days_Sold is in the 'cake' dataframe. What I would like to do now is if the common word in the 'count' dataframe shows up, like cupcake, how long would this take for me to sell the product using the 'cake' dataframe, and that would go through every common word in the 'count' dataframe until it's done? The answer should come out to be (3+4+1) 8 for cupcake.

My actual dataframe is over 3000 lines (and not exactly about cakes). The description is longer. I need over 40 common words, adjustable to my need.

This is why I can't be typing in each word. I believe this requires a 'nested for loop'. But I am stuck on it.

for day in cake:

    for top in count:

       top= count.Words

    day= cake.loc[cake['CleanDescr'] == count, ['Days_Sold']]

The error says: 'int' object is not iterable

Thank you!

Update:

Thank you so much to everyone helping me on this large project. I am posting my solution to #3, adjusted from the answer by Mark Moretto.

# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()
df

# Merge counts to main DataFrame
df_freq = pd.merge(df, count, on="Description")
df_freq

# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
            .loc[:, ["Description_x", "Values", "Days_Sold"]]
            .rename(columns={"Description_x": "Description"})
            )
df_freq

# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").mean().round(4)
df_metrics

df_metrics.head(5).sort_values(by='Values', ascending=False)
#print(df_metrics)

alvas · Answer 1 · 2021-08-30T12:40:17.253

Given

cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'], 
'Days_Sold': [3, 4, 1, 2, 2, 1]}

cake = pd.DataFrame(data=cup)

count = pd.DataFrame(Counter(" ".join(cake['Description']).split()).most_common())

count.columns = ['Words', 'Values']

count.index= np.arange(1, len(count)+1)

Your final count dataframe looks like:


    Words   Values
1   cupcake 3
2   cookie  3
3   organic 3
4   strawberry  2
5   blueberry   2

You can:

Convert the index to be a column, see How to convert index of a pandas dataframe into a column
Then, reindex your count dataframe by Words
Finally, you can use .loc(<key>)['Values] to get the no. of days

count_by_words = count.set_index('Words')
count_by_words.loc['cupcake']['Values']

The count_by_words DataFrame will look like this:


      index Values
Words       
cupcake 1   3
cookie  2   3
organic 3   3
strawberry  4   2
blueberry   5   2
grape   6   1
lemon   7   1

If the goal is to estimate max days sold based on the words in the description, you can try:

import pandas as pd
import numpy as np

from collections import Counter, defaultdict

cup = {'Description': ['strawberry cupcake', 'blueberry cupcake', 'strawberry cookie', 'grape organic cookie', 'blueberry organic cookie', 'lemon organic cupcake'], 
'Days_Sold': [3, 4, 1, 2, 2, 1]}

df = pd.DataFrame(data=cup)

word_counter = Counter() # Keeps track of the word count
word_days = defaultdict(list) # Keeps track of the max days sold
max_days = {}

# Iterate each row at a time.
for _, s in df.iterrows():
    words = s['Description'].split()
    word_counter += Counter(words)
    for word in words:
        # Keep tracks of different days_sold given a specific word.
        word_days[word].append(s['Days_Sold'])
        # If the max days for a word is lower than the row's days_sold
        if max(word_days.get(word, 0)) < s['Days_Sold']:
            # Set the max_days for the word as current days_sold
            max_days[word] = s['Days_Sold']
            
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})

            
df2 = pd.DataFrame({'max_days_sold': max_days, 'word_count':word_counter})

df2.loc['strawberry']['max_days_sold']

[out]:

        max_days_sold   word_count
strawberry  3   2
cupcake 4   3
blueberry   4   2
cookie  2   3
grape   2   1
organic 2   3
lemon   1   1

Hi, Thank you for the help. However, the Values isn't the day. The Values is in the count dataframe. The Days_Sold are in the cake dataframe. I'm trying to figure out that if, for example, the word "cookie" shows up, it would take on average how many days to be sold. The answer should be 5. I believe this requires a for loop. I will clarify in my question. Thank you again. — Kitty.Cattie, Aug 30 '21 at 11:45
Thank you so much Alvas!!! I was able to run this! In this case, how should I do an average or a median in this loop? — Kitty.Cattie, Aug 30 '21 at 12:32
By keeping track of the `word_days` object you should be able to get what you need. Hint: Change at `if max(word_days.get(word, 0)) < s['Days_Sold']:` — alvas, Aug 30 '21 at 12:41

score 0 · Accepted Answer · answered Aug 30 '21 at 13:25

Another way, though I didn't really remove any words by frequency or anything.

# <...your starter code for dataframe creation...>

# Split and explode Description
df = cake.iloc[:, 0].str.lower().str.split(r"\W+").explode().reset_index()

# Get count of words
df_counts = (df.groupby("Description")
            .size()
            .reset_index()
            .rename(columns={0: "word_count"})
            )

# Merge counts to main DataFrame
df_freq = pd.merge(df, df_counts, on="Description")

# Left join cake DataFrame onto df_freq by index values.
df_freq = (pd.merge(df_freq, cake, left_on = "index", right_index = True)
            .loc[:, ["Description_x", "word_count", "Days_Sold"]]
            .rename(columns={"Description_x": "Description"})
            )

# Group by Description and return max result for value fields
df_metrics = df_freq.groupby("Description").max()

print(df_metrics)

Output:

             word_count  Days_Sold
Description
blueberry             2          4
cookie                3          2
cupcake               3          4
grape                 1          2
lemon                 1          1
organic               3          2
strawberry            2          3

Hi Mark, thank you so much for helping me. Your code actually works really well. I was able to calculate the mean with this and sort_values in the cake data. I took out the count in the #Get count words, but it was so nice to see that there was another way to calculate the count. Thank you again. I will post my solution in the description and cite you. — Kitty.Cattie, Aug 30 '21 at 23:52

How to calculate a column using the most common words calculated from another dataframe in Python?

2 Answers2