0

I'm trying to count the occurences of multiple keywords within each phrases of a dataframe. This seems similar to other questions but not quite the same.

Here we have a df and a list of lists containing keywords/topics:

df=pd.DataFrame({'phrases':['very expensive meal near city center','very good meal and waiters','nice restaurant near center and public transport']})

topics=[['expensive','city'],['good','waiters'],['center','transport']]

for each phrase, we want to count how many words match in each separate topic. So the first phrase should score 2 for 1st topic, 0 for 2nd topic and 1 for 3rd topic, etc

I've tried this but it does not work:

from collections import Counter
topnum=0
for t in topics:
counts=[]
topnum+=1
results = Counter()
for line in df['phrases']:
  for c in line.split(' '):
    results[c] = t.count(c)
  counts.append(sum(results.values()))
df['topic_'+str(topnum)] = counts

I'm not sure what i'm doing wrong, ideally i would end up with a count of matching words for each topic/phrases combinations but instead the counts seem to repeat themselves:

phrases                                            topic_1  topic_2     topic_3
very expensive meal near city centre              2             0           0
very good meal and waiters                        2             2           0
nice restaurant near center and public transport  2             2           2

Many thanks to whoever can help me. Best Wishes

tezzaaa
  • 459
  • 1
  • 6
  • 17
  • Always provide a complete [mre] with code, **data, errors, current output, and expected output**, as **[formatted text](https://stackoverflow.com/help/formatting)**. If relevant, only plot images are okay. Please see [How to ask a good question](https://stackoverflow.com/help/how-to-ask). Provide data with [How to provide a reproducible copy of your DataFrame using `df.head(15).to_clipboard(sep=',')`](https://stackoverflow.com/questions/52413246), then **[edit] your question**, and paste the clipboard into a code block. – Trenton McKinney Feb 09 '21 at 21:55

1 Answers1

0

Here is a solution that defines a helper function called find_count and applies it as a lambda to the dataframe.

import pandas as pd
df=pd.DataFrame({'phrases':['very expensive meal near city center','very good meal and waiters','nice restaurant near center and public transport']})
topics=[['expensive','city'],['good','waiters'],['center','transport']]

def find_count(row, topics_index):
    count = 0
    word_list = row['phrases'].split()
    for word in word_list:
        if word in topics[topics_index]:
            count+=1
    return count

df['Topic 1'] = df.apply(lambda row:find_count(row,0), axis=1)
df['Topic 2'] = df.apply(lambda row:find_count(row,1), axis=1)
df['Topic 3'] = df.apply(lambda row:find_count(row,2), axis=1)

print(df)

#Output
                                            phrases  Topic 1  Topic 2  Topic 3
0              very expensive meal near city center        2        0        1
1                        very good meal and waiters        0        2        0
2  nice restaurant near center and public transport        0        0        2
pakpe
  • 5,391
  • 2
  • 8
  • 23
  • Sorry i'm actually not getting the same output as you, when i run your code for the first phrase topic 3 gets 0 as opposed to getting 1 which would be correct.. – tezzaaa Feb 09 '21 at 22:41
  • Make sure you copy and paste my code exact. The output I printed is the output of this code. Did you alter it in some way? – pakpe Feb 09 '21 at 22:50
  • ah yes big sorry it's working now, i must have done something wrong! I think i tried to run it in a loop as i have in fact thousands of phrases and 60 topics.. – tezzaaa Feb 09 '21 at 22:53
  • all good now, I added: topic_n=0 for each in topics: topic_n+=1 df['topic'+str(topic_n)]=df.apply(lambda row:find_count(row,topic_n-1), axis=1) df.head() – tezzaaa Feb 09 '21 at 22:56
  • Good. You had me worried for a second. LOL. – pakpe Feb 09 '21 at 22:59