-1

I need to write a function that finds the most common word in a string of text so that if I define a "word" as any sequence of words.

It can return the most common words.

Shivam
  • 113
  • 1
  • 10

4 Answers4

4

For general purposes, it is better to use boundary("word") in stringr:

library(stringr)
most_common_word <- function(s){
    which.max(table(s %>% str_split(boundary("word"))))
}
sentence <- "This is a very short sentence. It has only a few words: a, a. a"
most_common_word(sentence)
  • Excellent use of the `boundary` modifier. – www Oct 12 '17 at 14:21
  • Thanks for the great answer, it worked. Now, just a question, if I am having a text file, i read it as desc. words <- rep('', length(desc)); system.time( for(i in 1:length(desc)) { words[i] <- most_common_word(desc[i]) } ) then, while calculating, I am facing the error. Also saving this code ablove in question. – Shivam Oct 12 '17 at 14:45
  • @S. Olivia , please see the EDIT 2 in question? Any suggestion? – Shivam Oct 12 '17 at 14:48
2

Hope this helps:

   most_common_word=function(x){

      #Split every word into single words for counting
      splitTest=strsplit(x," ")

      #Counting words
      count=table(splitTest)

      #Sorting to select only the highest value, which is the first one
      count=count[order(count, decreasing=TRUE)][1]

      #Return the desired character. 
      #By changing this you can choose whether it show the number of times a word repeats
      return(names(count))
      }

You can use return(count) to show the word plus the number of time it repeats. This function has problems when two words are repeated the same amount of times, so beware of that.

The order function gets the highest value (when used with decreasing=TRUE), then it depends on the names, they get sorted by alphabet. In the case the words 'a' and 'b' are repeated the same amount of times, only 'a' get displayed by the most_common_word function.

Cris
  • 787
  • 1
  • 5
  • 19
2

Here is a function I designed. Notice that I split the string based on white space, I removed any leading or lagging white space, I also removed ".", and I converted all upper case to lower case. Finally, if there is a tie, I always reported the first word. These are assumptions you should think about for your own analysis.

# Create example string
string <- "This is a very short sentence. It has only a few words."

library(stringr)

most_common_word <- function(string){
  string1 <- str_split(string, pattern = " ")[[1]] # Split the string
  string2 <- str_trim(string1) # Remove white space
  string3 <- str_replace_all(string2, fixed("."), "") # Remove dot
  string4 <- tolower(string3) # Convert to lower case
  word_count <- table(string4) # Count the word number
  return(names(word_count[which.max(word_count)][1])) # Report the most common word
}

most_common_word(string)
[1] "a"
www
  • 38,575
  • 12
  • 48
  • 84
1

Using the tidytext package, taking advantage of established parsing functions:

library(tidytext)
library(dplyr)
word_count <- function(test_sentence) {
unnest_tokens(data.frame(sentence = test_sentence, 
    stringsAsFactors = FALSE), word, sentence) %>% 
count(word, sort = TRUE)
}

word_count("This is a very short sentence. It has only a few words.")

This gives you a table with all the word counts. You can adapt the function to obtain just the top one, but be aware that there will sometimes be ties for first, so perhaps it should be flexible enough to extract multiple winners.

David Klotz
  • 2,401
  • 1
  • 7
  • 16