Count word occurrences in R

Question

Is there a function for counting the number of times a particular keyword is contained in a dataset?

For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.

IRTFM · Accepted Answer · 2011-10-17T18:58:54.267

46

Let's for the moment assume you wanted the number of element containing "corn":

length(grep("corn", dataset))
[1] 3

After you get the basics of R down better you may want to look at the "tm" package.

EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:

grep("\\<corn\\>", dataset)

edited Oct 17 '11 at 18:58

answered Oct 16 '11 at 03:41

IRTFM

258,963
21
364
487

You could split the vectors on " ", do unique and run table on the whole thing. :) – Roman Luštrik Oct 16 '11 at 09:16
3

Right. Which highlights the ambiguity of the original question. I could not figure out why 4 was the right number. Your method would return 2 for "corn", 1 for "meal", and 1 for "cornmeal". The greppish way to count space-delimited words "corn" might be: length(grep("^corn$|^corn | corn$", dataset)) – IRTFM Oct 16 '11 at 15:18
That was a typo, sorry. The count would be 3. – LNA Oct 16 '11 at 15:57
That is, I upvoted and accepted your answer. Thanks again! Though I don't understand what Roman means by splitting the vectors on "" and doing "unique." – LNA Oct 16 '11 at 18:05
1

I think he typed it with a space between the quotes, which would have givne you a list of "whole words". The proportional font in which comments are diplayed probably made you think there was no space. – IRTFM Oct 16 '11 at 21:03
@42- do you know by chance whether `grep("corn", dataset)` is faster, equivalent, or slower than `which("corn"==dataset)`? – Antoine Jul 12 '16 at 10:32

petermeissner · Answer 2 · 2017-03-23T05:14:29.993

34

Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:

library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")

# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0

# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0

# summing it up
sum(str_count(dataset, "corn"))
# [1] 3

edited Mar 23 '17 at 05:14

answered Mar 12 '13 at 08:43

petermeissner

12,234
5
63
63

This method is better if you want to count occurences inside the vector members. In this example: `dataset <- c("corn corn", "cornmeal", "corn on the cob", "meal")` will return `# [1] 2 1 1 0`, while grep will always return `# [1] 1 1 1 0`. ` – HaReL Oct 03 '18 at 20:35
Hi, what if I want to do this but with a dictionary? so say I have dictionary A and I would like to know how many times words from dictionary A occur in the dataset, with both (1) repetition of the same word and (2) iterations of different words included in the summation? I tried sum(str_count()) with a dictionary, but it only includes the count of how many times a unique word from the dictionary occurs rather than including repetition of words. – hongpastry Feb 23 '21 at 17:40
Sounds like a new question, rather than a comment. I suggest you open a new question and add some code examples (example data, things tried so far, results you would like to get) to sketch out your problem so it can be understood easily. – petermeissner Feb 25 '21 at 08:37

score 2 · Answer 3 · answered Dec 02 '17 at 12:48

2

You can also do something like the following:

length(dataset[which(dataset=="corn")])

answered Dec 02 '17 at 12:48

Junaid

3,477
1
24
24

score 1 · Answer 4 · answered Oct 15 '18 at 21:11

1

I'd just do it with string division like:

library(roperators)

dataset <- c("corn", "cornmeal", "corn on the cob", "meal")

# for each vector element:
dataset %s/% 'corn'

# for everything:
sum(dataset %s/% 'corn')

answered Oct 15 '18 at 21:11

Benbob

328
2
4

score 1 · Answer 5 · answered Apr 21 '21 at 12:02

You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.

The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.

The regular expression syntax is very flexible and allows matching whole words as well as character patterns.

For example the following code will count all occurrences of the string "corn" and will return 3:

sum(str_count(dataset, regex("corn")))

To match complete words use:

sum(str_count(dataset, regex("\\bcorn\\b")))

The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.

This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.

The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.

sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))

Count word occurrences in R

5 Answers5

Linked

Related