5

I have a dataframe with sentences in the first column and I want to count the words in it :

Input :

Foo bar
bar example
lalala foo
example sentence foo

Output :

foo       3
bar       2
example   2
lalala    1
sentence  1

Is there a simple way to do this ?

If not, how should I do it ? I see two ways :

Append all the sentences in one huge string
And then count the words somehow

(seems very inefficient) Or

Split the column in multiple columns on spaces " " (I know there's a package for that, can't remember which one)
And then rbind each columns into one
François M.
  • 4,027
  • 11
  • 30
  • 81

2 Answers2

10

As in your second approach. We can split the columns on empty space (" ") and then use table to count the frequencies of each word. Also it seems that the output should be case insensitive, hence converting the column into lower case before splitting.

Assuming your dataframe is called df and the target column is V1.

table(unlist(strsplit(tolower(df$V1), " ")))

 #bar  example      foo   lalala sentence 
 #  2        2        3        1        1 

If this needs to be in a dataframe,

data.frame(table(unlist(strsplit(tolower(df$V1), " "))))

#      Var1 Freq
#1      bar    2
#2  example    2
#3      foo    3
#4   lalala    1
#5 sentence    1

EDIT

As per OP's update in comments if there is a score column for each sentence and we need to sum them for every word.

Adding a reproducible example,

df <- data.frame(v1 = c("Foo bar", "bar example", "lalala foo","example sentence foo"), 
                 score = c(2, 3, 1, 4))
df

#                    v1 score
#1              Foo bar     2
#2          bar example     3
#3           lalala foo     1
#4 example sentence foo     4

One way of solving this is with packages splitstackshape and dplyr. We convert each sentence into a long dataframe using cSplit and then summarise for every word calculating the frequency (n()) and the sum.

library(splitstackshape)
library(dplyr)
cSplit(df, "v1", sep = " ", direction = "long") %>%
      group_by(tolower(v1)) %>%
      summarise(Count = n(), 
                ScoreSum = sum(score))

#  tolower(v1) Count ScoreSum
#        (chr) (int)    (dbl)
#1         foo     3        7
#2         bar     2        5
#3     example     2        7
#4      lalala     1        1
#5    sentence     1        4

Or using only tidyverse

library(tidyverse)

df %>%
  separate_rows(v1, sep = ' ') %>%
  group_by(v1 = tolower(v1)) %>%
  summarise(Count = n(), 
            ScoreSum = sum(score))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks. What if I had another column in my input dataframe with a score for each sentence, and wanted to keep the score of each word (by computing the sum) ? – François M. Mar 13 '17 at 08:55
1

Try this:

library(stringr)
df$freq<-str_count(df$word,'\\w+')
Shenglin Chen
  • 4,504
  • 11
  • 11
  • You need to add str_count(df$word,'\\w+')+1 as one word would return 0 – Akki Dec 21 '17 at 05:15
  • This just counts how many times the regex pattern occurs per value in the vector. They wanted the frequency of each word across the whole vector – camille Feb 12 '22 at 17:04