1

I have a Python dataframe with multiple rows and columns, a sample of which I have shared below -

DocName Content
Doc1 Hi how you are doing ? Hope you are well. I hear the food is great!
Doc2 The food is great. James loves his food. You not so much right ?
Doc3. Yeah he is alright.

I also have a list of 100 words as follows -

list = [food, you, ....]           

Now, I need to extract the top N rows with most frequent occurences of each word from the list in the "Content" column. For the given sample of data,

"food" occurs twice in Doc2 and once in Doc1.

"you" occurs twice in Doc 1 and once in Doc 2.

Hence, desired output is :

[food:[doc2, doc1], you:[doc1, doc2], .....]

where N = 2 ( top 2 rows having the most frequent occurence of each word )

I have tried something as follows but unsure how to move further -

list = [food, you, ....]
result = []

for word in list:
    result.append(df.Content.apply(lambda row: sum([row.count(word)])))

How can I implement an efficient solution to the above requirement in Python ?

dravid07
  • 437
  • 3
  • 6
  • 16
  • Please show some of your own effort to implement a solution. Stack Overflow is not a code writing service. – Frodnar Nov 25 '22 at 03:30
  • Have updated the question with my initial attempt. – dravid07 Nov 25 '22 at 04:50
  • [This SO answer](https://stackoverflow.com/questions/67385186/counting-the-occurrence-of-specific-words-in-a-dataframe) may point you in the right direction) – Dash Nov 25 '22 at 05:23

2 Answers2

1

Second attempt (initially I misunderstood your requirements): With df your dataframe you could try something like:

words = ["food", "you"]
n = 2  # Number of top docs
res = (
    df
    .assign(Content=df["Content"].str.casefold().str.findall(r"\w+"))
    .explode("Content")
    .loc[lambda df: df["Content"].isin(set(words))]
    .groupby("DocName").value_counts().rename("Counts")
    .sort_values(ascending=False).reset_index(level=0)
    .assign(DocName=lambda df: df["DocName"] + "_" + df["Counts"].astype("str"))
    .groupby(level=0).agg({"DocName": list})
    .assign(DocName=lambda df: df["DocName"].str[:n])
    .to_dict()["DocName"]
)
  • The first 3 lines in the pipeline extract the relevant words, one per row. For the sample that looks like:

       DocName Content
    0    Doc1     you
    0    Doc1     you
    0    Doc1    food
    1    Doc2    food
    1    Doc2    food
    1    Doc2     you
    
  • The next 2 lines count the words per doc (.groupby and .value_counts), and sort the result by the counts in descending order (.sort_values), and add the count to the doc-strings. For the sample:

             DocName  Counts
    Content                
    you      Doc1_2       2
    food     Doc2_2       2
    food     Doc1_1       1
    you      Doc2_1       1
    
  • Then .groupby the words (index) and put the respective docs in a list via .agg, and restrict the list to the n first items (.str[:n]). For the sample:

                       DocName
    Content                  
    food     [Doc2_2, Doc1_1]
    you      [Doc1_2, Doc2_1]
    
  • Finally dumping the result in a dictionary.

Result for the sample dataframe

  DocName                                                              Content
0    Doc1  Hi how you are doing ? Hope you are well. I hear the food is great!
1    Doc2  The food is great. James loves his food. You not so much right ?
2    Doc3  Yeah he is alright.

is

{'food': ['Doc2_2', 'Doc1_1'], 'you': ['Doc1_2', 'Doc2_1']}
Timus
  • 10,974
  • 5
  • 14
  • 28
  • This is amazing Timus, thank you so much. If I were to attach the counts along with the doc name, how can I do that ? For example - {'food': ['Doc2 _2', 'Doc1_1'], 'you': ['Doc1_2', 'Doc2_1']} – dravid07 Nov 27 '22 at 23:19
  • @dravid07 See the edit (one additional line after resetting the index is adding the counts to the `DocName` column). – Timus Nov 28 '22 at 08:49
  • 1
    Thanks a ton Timus. The consise Pythonic code is so beautiful to look at :) I hope I can be at that level some day haha !! – dravid07 Nov 28 '22 at 16:13
  • By the way, is there any way that I can connect with you Timus ? – dravid07 Nov 28 '22 at 16:15
  • @dravid07 Thanks for the feedback, I'm glad you like it. What do you mean by connect: Something like LinkedIn? – Timus Nov 28 '22 at 19:08
  • Yeah Linkedin would be nice. – dravid07 Nov 28 '22 at 20:39
  • @dravid07 I have an email address in my profile here. You could send me a link to your LinkedIn profile, if you like. – Timus Nov 29 '22 at 08:57
0

It seems like this problem can be broken down into two sub-problems:

  1. Get the frequency of words per "Content" cell
  2. For each word in the list, extract the top N rows

Luckily, the first sub-problem has many neat approaches, as shown here. TLDR use the Collections library to do a frequency count; or, if you aren't allowed to import libraries, call ".split()" and count in a loop. But again, there are many potential solutions

The second sub-problem is a bit trickier. From our first solution, what we have now is a dictionary of frequency counts, per row. To get to our desired answer, the naive method would be to "query" every dictionary for the word in question. E.g run

doc1.dict["food"]
doc2.dict["food"]
...

and compare the results in order.

There should be enough to get going, and also opportunity to find more streamlined/elegant solutions. Best of luck!

ZeThey
  • 21
  • 4