I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -
DocName | Content |
---|---|
Doc1 | Hi how you are doing ? Hope you are well. I hear the food is great! |
Doc2 | The food is great. James loves his food. You not so much right ? |
Doc3. | Yeah he is alright. |
I also have a list of 100 words as follows -
list = [food, you, ....]
Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,
"food" occurs twice in Doc2 and once in Doc1.
"you" occurs twice in Doc 1 and once in Doc 2.
Hence, desired output is :
[food:[doc2, doc1], you:[doc1, doc2], .....]
where N = 2 ( top 2 rows having the most frequent occurence of each word )
I have tried something as follows but unsure how to move further -
list = [food, you, ....]
result = []
for word in list:
result.append(df.Content.apply(lambda row: sum([row.count(word)])))
How can I implement an efficient solution to the above requirement in Python ?