I have a data frame in which each row represents a customer message. I want a data frame with a Document Frequency - count the number of documents that contain that word. How can I get that?
For example, I have this
DATAFRAME A
customer message
A hi i need help i want a card
B i want a card
The output I want is:
DATAFRAME B
word document_frequency
hi 1
i 2 --> 2 documents contain "i", regardless the times it appears in each document
need 1
help 1
want 2
a 2
card 2
What I have so far is the tokenized messages and the frequency of each word considering each document (times the word appears in each document, not the number of documents contain that word). The output of tokenized messages is like this:
0 [hi, i, need, help, i, want, a, card,]
1 [i, want, a, card]
And the frequency of each word is a data frame like this:
DATAFRAME C
word frequency
hi 1
i 3 --> word "i" appears 3 times
need 1
help 1
want 2
a 2
card 2