0

I have a huge data frame and I am interested in finding the occurrence of all words in a specific column, for example:

Column
Hi my name is Corey!
Hi my name is John

Desired output:

Hi 2
my 2
name 2
is 2
Corey 1
John 1

I also want to exclude special letters like ! in Corey! in this example, also like question marks, periods etc... Any help would be appreciated, thanks!

  • 2
    Sounds like what you're looking for is a [document-term matrix](https://en.wikipedia.org/wiki/Document-term_matrix)? – AkselA Oct 09 '18 at 00:22
  • 1
    As it sounds like you only have one "document," you may want [`tm`](https://cran.r-project.org/package=tm)'s `termFreq()` instead; here `tm::termFreq(x, control = list(removePunctuation = TRUE, wordLengths = c(1, Inf)))` (where you replace `x` with your vector, such as `df$word_column`) – duckmayr Oct 09 '18 at 00:37

1 Answers1

2
df <- data.frame(column = c('Hi my name is Corey!',
  'Hi my name is John'))
df

#column
#1 Hi my name is Corey!
#2   Hi my name is John

all_words <- unlist( # flattten word list from individual strings into one vector
  regmatches(df$column,  gregexpr('\\w+', df$column))) # extract all words
# count frequencies
freq_count <- table(all_words)
freq_count

#Corey    Hi    is  John    my  name 
#1     2     2     1     2     2 
Yosi Hammer
  • 588
  • 2
  • 8