1

Currently I have enrolled in a R course and one of the practice exercises is building a R program to count words in a string. We cannot use the function table but must return an output of the most popular word in a string using conventional means. i.e. The fox jumped over the cone and the... So the program would have to return "the" as it is the most popular phrase.

So far I have the following:

string_read<- function(phrase) {

  phrase <- strsplit(phrase, " ")[[1]]
  for (i in 1:length(phrase)){
    phrase.freq <- ....
#if Word already exists then increase counter by 1

      }

I've hit a road block however as I'm not sure how to increase the counter for specific words. Can anyone give me a pointer in the right direction? My psuedo code would be something like: "For every word that is looped through, increase wordIndex by 1. If word has already occured before, increase wordIndex counter."

Axeman
  • 32,068
  • 8
  • 81
  • 94
IronKirby
  • 708
  • 1
  • 7
  • 24
  • I'm aware similar variants have been asked but they tend to use table, library etc. or the like which the teaching advisor has ruled out. – IronKirby Jun 07 '17 at 06:15
  • Have you learned about the `list` datastructure in R? I think it would work well for storing the counts for each word. – Marius Jun 07 '17 at 06:16
  • We covered it very briefly - I'm happy to take a deeper look into lists however! We covered that alongside matrix structures I believe. – IronKirby Jun 07 '17 at 06:18
  • OK, if you remember that you can set and retrieve list values using strings, I think you'll be off to a good start, like `count_list[["fox"]] = 0; count_list[["fox"]] = count_list[["fox"]] + 1;` – Marius Jun 07 '17 at 06:22
  • I see! But the only problem with that is with a phrase that has X many elements, I can't create a list for every single permutation because it wouldn't be scalable then? Apologies if I have misunderstood. – IronKirby Jun 07 '17 at 06:35
  • Possible duplicate of https://stackoverflow.com/questions/8920145/count-the-number-of-words-in-a-string-in-r or https://stackoverflow.com/questions/7782113/counting-word-occurrences-in-r or – akrun Jun 07 '17 at 06:38

2 Answers2

3

You started off correctly by splitting the string into words, then we loop over each word using sapply and sum the similar words in the vector. I have used tolower assuming this operation is not case sensitive.

string_read<- function(phrase) {
   temp = tolower(unlist(strsplit(phase, " ")))
   which.max(sapply(temp, function(x) sum(x == temp)))
}

phrase <- "The fox jumped over the cone and the"

string_read(phrase)
#the 
#  1 

This returns output as the word and its index position which is 1 in this case. If you just want the word with maximum count , you can change the last line to

temp[which.max(sapply(temp, function(x) sum(x == temp)))]
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Hi Ronak, thanks for that information! I'd like to try to understand what you have done here. You have converted the string to a lower case operation and then you have split it used the " " as a delimiter. I just looked up unlist and I see that it converts a list to a vector. However, when you use strsplit isn't it already vectorised? – IronKirby Jun 07 '17 at 06:37
  • @azurekirby basically, `unlist` does the same thing as `strsplit(phrase, " ")[[1]]` . It converts the `list` into a vector. – Ronak Shah Jun 07 '17 at 06:40
  • 1
    I see. But if you use the is.vector(phrase) it would return TRUE already so aren't you vectorising a vector that is already a vector when you use unlist? – IronKirby Jun 07 '17 at 06:52
  • @azurekirby okay. So `phrase` is of class "character", check `class(phrase)` and when we do `strsplit(phrase, " ")` it is a list, check `class(strsplit(phase, " "))`, so when we `unlist` or do `[[1]]` as in your case we convert it into the same class as original i.e "character", check `class(temp)` or `class(strsplit(phrase, " ")[[1]])` but the only difference now is its length, check `length(phrase)` and `length(temp)` . Hope it clarifies. – Ronak Shah Jun 07 '17 at 07:04
  • Thanks Ronak. I checked this with my instructor and it works, but when we tried calling it via the console line. i.e. phrase <- readline() It didn't quite work. Are you able to provide any advice regarding that? – IronKirby Jun 07 '17 at 10:02
  • @azurekirby So do you have the strings stored in a text file ? and you are giving the path of it via console? – Ronak Shah Jun 07 '17 at 10:12
  • I was using the string as a test case. But now the supervisor has said, let the string be taken from the console so multiple tests can be run against this. I changed the code to read: phrase <- readline(n=1) #specifying the number of lines I will take But it hasn't quite worked when I have done that. – IronKirby Jun 07 '17 at 10:15
  • @azurekirby I am in a bit of hurry right now, you could check [this](https://www.r-bloggers.com/passing-arguments-to-an-r-script-from-command-lines/) post for some reference. – Ronak Shah Jun 07 '17 at 10:19
  • No problems. Thanks for your help Ronak. – IronKirby Jun 07 '17 at 10:24
0

We can do this with str_extract

library(stringr)
string_read<- function(str1) {
  temp <- tolower(unlist(str_extract_all(str1, "\\w+")))
  which.max(sapply(temp, function(x) sum(x == temp)))
}

phrase <- "The fox jumped over the cone and the"
string_read(phrase)
#the 
#  1 
phrase2 <- "The fox jumped over the cone and the fox, fox, fox, fox, fox"
string_read(phrase)
#fox 
# 2 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hi Akrun, if you add another series of words to this, say, fox, fox, fox, fox, fox - I think it still registers 'the' as the most common word which is not correct. – IronKirby Jun 07 '17 at 11:52