As in your second approach. We can split
the columns on empty space (" "
) and then use table
to count the frequencies of each word. Also it seems that the output should be case insensitive, hence converting the column into lower case before splitting.
Assuming your dataframe is called df
and the target column is V1
.
table(unlist(strsplit(tolower(df$V1), " ")))
#bar example foo lalala sentence
# 2 2 3 1 1
If this needs to be in a dataframe,
data.frame(table(unlist(strsplit(tolower(df$V1), " "))))
# Var1 Freq
#1 bar 2
#2 example 2
#3 foo 3
#4 lalala 1
#5 sentence 1
EDIT
As per OP's update in comments if there is a score
column for each sentence and we need to sum
them for every word.
Adding a reproducible example,
df <- data.frame(v1 = c("Foo bar", "bar example", "lalala foo","example sentence foo"),
score = c(2, 3, 1, 4))
df
# v1 score
#1 Foo bar 2
#2 bar example 3
#3 lalala foo 1
#4 example sentence foo 4
One way of solving this is with packages splitstackshape
and dplyr
. We convert each sentence into a long dataframe using cSplit
and then summarise for every word calculating the frequency (n()
) and the sum
.
library(splitstackshape)
library(dplyr)
cSplit(df, "v1", sep = " ", direction = "long") %>%
group_by(tolower(v1)) %>%
summarise(Count = n(),
ScoreSum = sum(score))
# tolower(v1) Count ScoreSum
# (chr) (int) (dbl)
#1 foo 3 7
#2 bar 2 5
#3 example 2 7
#4 lalala 1 1
#5 sentence 1 4
Or using only tidyverse
library(tidyverse)
df %>%
separate_rows(v1, sep = ' ') %>%
group_by(v1 = tolower(v1)) %>%
summarise(Count = n(),
ScoreSum = sum(score))