I need to write a function that finds the most common word in a string of text so that if I define a "word" as any sequence of words.
It can return the most common words.
For general purposes, it is better to use boundary("word")
in stringr
:
library(stringr)
most_common_word <- function(s){
which.max(table(s %>% str_split(boundary("word"))))
}
sentence <- "This is a very short sentence. It has only a few words: a, a. a"
most_common_word(sentence)
Hope this helps:
most_common_word=function(x){
#Split every word into single words for counting
splitTest=strsplit(x," ")
#Counting words
count=table(splitTest)
#Sorting to select only the highest value, which is the first one
count=count[order(count, decreasing=TRUE)][1]
#Return the desired character.
#By changing this you can choose whether it show the number of times a word repeats
return(names(count))
}
You can use return(count)
to show the word plus the number of time it repeats. This function has problems when two words are repeated the same amount of times, so beware of that.
The order
function gets the highest value (when used with decreasing=TRUE
), then it depends on the names, they get sorted by alphabet. In the case the words 'a'
and 'b'
are repeated the same amount of times, only 'a'
get displayed by the most_common_word
function.
Here is a function I designed. Notice that I split the string based on white space, I removed any leading or lagging white space, I also removed ".", and I converted all upper case to lower case. Finally, if there is a tie, I always reported the first word. These are assumptions you should think about for your own analysis.
# Create example string
string <- "This is a very short sentence. It has only a few words."
library(stringr)
most_common_word <- function(string){
string1 <- str_split(string, pattern = " ")[[1]] # Split the string
string2 <- str_trim(string1) # Remove white space
string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
string4 <- tolower(string3) # Convert to lower case
word_count <- table(string4) # Count the word number
return(names(word_count[which.max(word_count)][1])) # Report the most common word
}
most_common_word(string)
[1] "a"
Using the tidytext
package, taking advantage of established parsing functions:
library(tidytext)
library(dplyr)
word_count <- function(test_sentence) {
unnest_tokens(data.frame(sentence = test_sentence,
stringsAsFactors = FALSE), word, sentence) %>%
count(word, sort = TRUE)
}
word_count("This is a very short sentence. It has only a few words.")
This gives you a table with all the word counts. You can adapt the function to obtain just the top one, but be aware that there will sometimes be ties for first, so perhaps it should be flexible enough to extract multiple winners.