1

I have a data frame in which each row represents a customer message. I want a data frame with a Document Frequency - count the number of documents that contain that word. How can I get that?

For example, I have this

DATAFRAME A

customer  message
A         hi i need help i want a card
B         i want a card

The output I want is:

DATAFRAME B
word  document_frequency
hi      1
i       2 --> 2 documents contain "i", regardless the times it appears in each document
need    1 
help    1
want    2
a       2
card    2

What I have so far is the tokenized messages and the frequency of each word considering each document (times the word appears in each document, not the number of documents contain that word). The output of tokenized messages is like this:

0  [hi, i, need, help, i, want, a, card,]
1  [i, want, a, card]

And the frequency of each word is a data frame like this:

DATAFRAME C
word  frequency
hi      1
i       3 --> word "i" appears 3 times
need    1 
help    1
want    2
a       2
card    2
Emma
  • 27,428
  • 11
  • 44
  • 69
rebar
  • 105
  • 7

4 Answers4

2

From your original DataFrame, set the index, split the strings, explode and reset the index. This splits each word into its own cell, and the index manipulation makes it so we maintain the 'customer' it was attached with.

drop_duplicates so words are only counted once within each 'customer' and groupby + size to count the documents.

import pandas as pd
df = pd.DataFrame({'customer': ['A', 'B'], 
                   'message': ['hi i need help i want a card', 'i want a card']})

(df.set_index('customer')['message'].str.split().explode()
   .reset_index()
   .drop_duplicates()
   .groupby('message').size()
)

message
a       2
card    2
help    1
hi      1
i       2
need    1
want    2
dtype: int64

If you start from that Series, s, of lists with tokens, then do: s.explode().reset_index()...

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • it returns "'Series' object has no attribute 'explode'" – rebar Jun 08 '20 at 20:01
  • @RenataBiaggi you must be using an older version of `pandas`. That method was added in `0.25` so if you can, I suggest you update to a more recent version (currently it's at `1.0.4`). If that's not possible for whatever reason then you can look to: https://stackoverflow.com/questions/53218931/how-to-unnest-explode-a-column-in-a-pandas-dataframe for a workaround in earlier versions. – ALollz Jun 08 '20 at 20:33
0

You should be able to use a collections.Counter for something like this:

from pandas import DataFrame
from collections import Counter

df = DataFrame([{'customer': 'A', 'message': 'hi i need help i want a card'}, {'customer': 'B', 'message': 'i want a card'}])
#   customer                       message
# 0        A  hi i need help i want a card
# 1        B                 i want a card
counter = Counter()
for row in df['message']:
    counter.update(row.split())
print(counter)
# Counter({'i': 3, 'want': 2, 'a': 2, 'card': 2, 'hi': 1, 'need': 1, 'help': 1})

If you want to be able to do it case-insensitively, you can throw a .lower() in there between row and .split().

Green Cloak Guy
  • 23,793
  • 4
  • 33
  • 53
  • This one is giving me what I already have, which I called Dataframe C. What I want is Dataframe B – rebar Jun 08 '20 at 19:51
0

This would be the python 101 looping answer without using some of the more powerful pandas functions.

df['word'] = df['message'].str.split(' ')
df1 = pd.DataFrame({'word' :[]})
word_list = []
for sentence in df['word']:
    for word in sentence:
        word_list.append(word)
df1['word'] = word_list
C = pd.DataFrame(df1.groupby('word')['word'].count()).rename({'word' : 'document_frequency'}, axis=1).reset_index()
C

Ouput:

    word    document_frequency
0   a       2
1   card    2
2   help    1
3   hi      1
4   i       3
5   need    1
6   want    2
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • This one is giving me what I already have, which I called Dataframe C. What I want is Dataframe B – rebar Jun 08 '20 at 19:49
0

you can use str.get_dummies to see if a word is in a row and sum

res = df['message'].str.get_dummies(sep=' ').sum()
print (res)
a       2
card    2
help    1
hi      1
i       2
need    1
want    2
dtype: int64
Ben.T
  • 29,160
  • 6
  • 32
  • 54