Counting Words in vector

Question

Currently I have enrolled in a R course and one of the practice exercises is building a R program to count words in a string. We cannot use the function table but must return an output of the most popular word in a string using conventional means. i.e. The fox jumped over the cone and the... So the program would have to return "the" as it is the most popular phrase.

So far I have the following:

string_read<- function(phrase) {

  phrase <- strsplit(phrase, " ")[[1]]
  for (i in 1:length(phrase)){
    phrase.freq <- ....
#if Word already exists then increase counter by 1

      }

I've hit a road block however as I'm not sure how to increase the counter for specific words. Can anyone give me a pointer in the right direction? My psuedo code would be something like: "For every word that is looped through, increase wordIndex by 1. If word has already occured before, increase wordIndex counter."

I'm aware similar variants have been asked but they tend to use table, library etc. or the like which the teaching advisor has ruled out. — IronKirby, Jun 07 '17 at 06:15
Have you learned about the `list` datastructure in R? I think it would work well for storing the counts for each word. — Marius, Jun 07 '17 at 06:16
We covered it very briefly - I'm happy to take a deeper look into lists however! We covered that alongside matrix structures I believe. — IronKirby, Jun 07 '17 at 06:18
OK, if you remember that you can set and retrieve list values using strings, I think you'll be off to a good start, like `count_list[["fox"]] = 0; count_list[["fox"]] = count_list[["fox"]] + 1;` — Marius, Jun 07 '17 at 06:22
I see! But the only problem with that is with a phrase that has X many elements, I can't create a list for every single permutation because it wouldn't be scalable then? Apologies if I have misunderstood. — IronKirby, Jun 07 '17 at 06:35
Possible duplicate of https://stackoverflow.com/questions/8920145/count-the-number-of-words-in-a-string-in-r or https://stackoverflow.com/questions/7782113/counting-word-occurrences-in-r or — akrun, Jun 07 '17 at 06:38

score 3 · Accepted Answer · answered Jun 07 '17 at 06:23

3

You started off correctly by splitting the string into words, then we loop over each word using sapply and sum the similar words in the vector. I have used tolower assuming this operation is not case sensitive.

string_read<- function(phrase) {
   temp = tolower(unlist(strsplit(phase, " ")))
   which.max(sapply(temp, function(x) sum(x == temp)))
}

phrase <- "The fox jumped over the cone and the"

string_read(phrase)
#the 
#  1

This returns output as the word and its index position which is 1 in this case. If you just want the word with maximum count , you can change the last line to

temp[which.max(sapply(temp, function(x) sum(x == temp)))]

answered Jun 07 '17 at 06:23

Ronak Shah

377,200
20
156
213

Hi Ronak, thanks for that information! I'd like to try to understand what you have done here. You have converted the string to a lower case operation and then you have split it used the " " as a delimiter. I just looked up unlist and I see that it converts a list to a vector. However, when you use strsplit isn't it already vectorised? – IronKirby Jun 07 '17 at 06:37
@azurekirby basically, `unlist` does the same thing as `strsplit(phrase, " ")[[1]]` . It converts the `list` into a vector. – Ronak Shah Jun 07 '17 at 06:40
1

I see. But if you use the is.vector(phrase) it would return TRUE already so aren't you vectorising a vector that is already a vector when you use unlist? – IronKirby Jun 07 '17 at 06:52
@azurekirby okay. So `phrase` is of class "character", check `class(phrase)` and when we do `strsplit(phrase, " ")` it is a list, check `class(strsplit(phase, " "))`, so when we `unlist` or do `[[1]]` as in your case we convert it into the same class as original i.e "character", check `class(temp)` or `class(strsplit(phrase, " ")[[1]])` but the only difference now is its length, check `length(phrase)` and `length(temp)` . Hope it clarifies. – Ronak Shah Jun 07 '17 at 07:04
Thanks Ronak. I checked this with my instructor and it works, but when we tried calling it via the console line. i.e. phrase <- readline() It didn't quite work. Are you able to provide any advice regarding that? – IronKirby Jun 07 '17 at 10:02
@azurekirby So do you have the strings stored in a text file ? and you are giving the path of it via console? – Ronak Shah Jun 07 '17 at 10:12
I was using the string as a test case. But now the supervisor has said, let the string be taken from the console so multiple tests can be run against this. I changed the code to read: phrase <- readline(n=1) #specifying the number of lines I will take But it hasn't quite worked when I have done that. – IronKirby Jun 07 '17 at 10:15
@azurekirby I am in a bit of hurry right now, you could check [this](https://www.r-bloggers.com/passing-arguments-to-an-r-script-from-command-lines/) post for some reference. – Ronak Shah Jun 07 '17 at 10:19
No problems. Thanks for your help Ronak. – IronKirby Jun 07 '17 at 10:24

akrun · Answer 2 · 2017-06-07T12:21:38.877

0

We can do this with str_extract

library(stringr)
string_read<- function(str1) {
  temp <- tolower(unlist(str_extract_all(str1, "\\w+")))
  which.max(sapply(temp, function(x) sum(x == temp)))
}

phrase <- "The fox jumped over the cone and the"
string_read(phrase)
#the 
#  1 
phrase2 <- "The fox jumped over the cone and the fox, fox, fox, fox, fox"
string_read(phrase)
#fox 
# 2

edited Jun 07 '17 at 12:21

answered Jun 07 '17 at 06:45

akrun

874,273
37
540
662

Hi Akrun, if you add another series of words to this, say, fox, fox, fox, fox, fox - I think it still registers 'the' as the most common word which is not correct. – IronKirby Jun 07 '17 at 11:52

Counting Words in vector

2 Answers2