Finding number of occurrences of a word in a file using R functions

Question

I am using the following code for finding number of occurrences of a word memory in a file and I am getting the wrong result. Can you please help me to know what I am missing?

NOTE1: The question is looking for exact occurrence of word "memory"! NOTE2: What I have realized they are exactly looking for "memory" and even something like "memory," is not accepted! That was the part which has brought up the confusion I guess. I tried it for word "action" and the correct answer is 7! You can try as well.

#names=scan("hamlet.txt", what=character())
names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())
Read 28230 items
> length(grep("memory",names))
[1] 9

Here's the file

What result are you expecting? `grep` will return the number of elements (lines) that contain the string "memory". If there are multiple instances per element, `grep` won't tell you. Would that explain any perceived discrepancies? — jbaums, Feb 05 '14 at 02:57
@Fernando I don't know what the right answer is. I just know that 9 is not correct. I don't know what are the alternatives to this solution. — Mona Jalal, Feb 05 '14 at 02:59
@jbaums so if grep doesn't help in that case, what should be used? — Mona Jalal, Feb 05 '14 at 02:59
Please make this a self-contained example with a small data file. Don't link to pastebin.com, put the data here. — Matthew Lundberg, Feb 05 '14 at 03:00
Have a look at `grep('memory', names, value=TRUE, ignore.case=TRUE)`. The 4th match has 2 instances of 'memory', so I guess it is mentioned 10 times in the text. — jbaums, Feb 05 '14 at 03:02
What's wrong with this syntax? `length(which(names["memory"]))` ? — Mona Jalal, Feb 05 '14 at 03:04
Furthermore, open http://pastebin.com/raw.php?i=kC9aRvfB, hit ctrl+f and search for 'memory'. You will see there are indeed 10 instances. — jbaums, Feb 05 '14 at 03:04
I don't get why there's a negative 1 to this question. It's huge data and doesn't make sense to put it here! — Mona Jalal, Feb 05 '14 at 03:05
@MonaJalal: I suspect @MatthewLundberg expects you to include a small example dataset. Alternatively, if the file you refer to will be permanently hosted there, edit your question and include `names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character())` — jbaums, Feb 05 '14 at 03:06
@jbaums You are correct. I expect a small example dataset, not a pastebin.com entry that will disappear tomorrow. I downvoted this question, and if there isn't an example dataset posted, I will also vote to close. — Matthew Lundberg, Feb 05 '14 at 03:08
I had already were aware of Pastebin and set the timing to fade never! It will be always there! — Mona Jalal, Feb 05 '14 at 03:10
Well, your refusal to put data in the question is my impetus to put it in the queue. Good luck! @MonaJalal You setting it to not expire isn't enough. *You* could delete it! — Matthew Lundberg, Feb 05 '14 at 03:11
I agree that efforts should be made to include data, but it's not always possible (e.g. if the OP needs help identifying peculiarities with the exact dataset). In fact, the community wiki on "How to make a great R reproducible example" [alludes to this](http://stackoverflow.com/a/6699112/489704). That said, perhaps the question could be rephrased to make it more general, and a subset of the data could be included. — jbaums, Feb 05 '14 at 03:17
@jbaums Reducing the problem to a small example is the first step in posting a question -- and in "real life," finding a solution, as often, this does lead to a solution. In the case that this does lead to a solution, there is no rule against a person posting a question and an answer. — Matthew Lundberg, Feb 05 '14 at 03:21

andypea · Answer 1 · 2016-10-31T22:48:02.780

The problem is really Shakespeare's use of punctuation. There are a lot of apostrophes (') in the text. When the R function scan encounters an apostrophe it assumes it is the start of a quoted string and reads all characters up until the next apostrophe into a single entry of your names array. One of these long entries happens to include two instances of the word "memory" and so reduces the total number of matches by one.

You can fix the problem by telling scan to regard all quotation marks as normal characters and not treat them specially:

names <- scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )

Be careful when using the R implementation of grep. It does not behave in exactly the same way as the usual GNU/Linux program. In particular, the way you have used it here WILL find the number of matching words and not just the total number of matching lines as some people have suggested.

Thanks for pointing that out, @andrew. I should have stated 'elements' rather than 'lines', although in @Fernando's example (using `readLines`), elements are lines. — jbaums, Feb 05 '14 at 03:26
@Fernando: in this particular case, there was at most 1 instance per element, but there could have been more. Using `scan` with args as suggested by @andrew splits the text to individual words. — jbaums, Feb 05 '14 at 03:28

Fernando · Accepted Answer · 2014-02-05T13:45:29.897

3

As pointed by @andrew, my previous answer would give wrong results if a word repeats on the same line. Based on other answers/comments, this one seems ok:

names = scan('http://pastebin.com/raw.php?i=kC9aRvfB', what=character(), quote=NULL )
idxs = grep("memory", names, ignore.case = TRUE)

length(idxs)
# [1] 10

edited Feb 05 '14 at 13:45

answered Feb 05 '14 at 03:18

Fernando

7,785
6
49
81

1

It's more likely two occurrences of the word in one line. – Matthew Lundberg Feb 05 '14 at 03:19
This is incorrect. Try `grep('memory', 'memory\n')` and see that `grep` isn't concerned about the '\n'. – jbaums Feb 05 '14 at 03:19
@jbaums length(grep("memory\n",names)) returns 1 here...i don't get it. – Fernando Feb 05 '14 at 03:21
1

Even after your edit, your code will not detect multiple instances of 'memory' in a single element (line). It tells us only the number of lines that include one or more instances of 'memory'. – jbaums Feb 05 '14 at 03:22
@Fernando: Think about it this way... `grep('memory', 'memory\n')` is saying "find 'memory' in the string 'memory\n'", whereas `grep('memory\n', 'memory')` is saying "find 'memory\n' in the string 'memory'". – jbaums Feb 05 '14 at 03:24
Does it give the exact word "memory" ? – Mona Jalal Feb 05 '14 at 03:27
No, for that you need to search for the pattern `^memory$`. Case will also be matched unless you add the argument `ignore.case=TRUE`. If you want to allow punctuation, e.g. commas after memory, then try `grep('^memory$', gsub('\\W', '', names))`. – jbaums Feb 05 '14 at 03:28

Finding number of occurrences of a word in a file using R functions

2 Answers2