Extracting specified word from a vector using R

Question

I have a text e.g

text<- "i am happy today :):)"

I want to extract :) from text vector and report its frequency

score 5 · Accepted Answer · answered Apr 11 '12 at 07:44

5

Here's one idea, which would be easy to generalize:

text<- c("i was happy yesterday :):)",
         "i am happy today :)",
         "will i be happy tomorrow?")

(nchar(text) - nchar(gsub(":)", "", text))) / 2
# [1] 2 1 0

answered Apr 11 '12 at 07:44

Josh O'Brien

159,210
26
366
455

1

You could also use the opposite for just one `nchar()` call: `nchar(gsub("[^:)]", "", text)) / 2` – Sacha Epskamp Apr 11 '12 at 09:03
@SachaEpskamp -- Unfortunately, that doesn't do quite the same thing, since it replaces everything except the the characters `:` and `)`, when you were really wanting to replace everything except the *string* `:)`. Try your idea with `text <- "A:"` to see what I mean. – Josh O'Brien Apr 11 '12 at 16:32
1

Right thanks. I am fairly new to regular expressions. I think this works? `nchar(gsub("^((?!:\\)).)*", "", text, perl=TRUE)) / 2` – Sacha Epskamp Apr 11 '12 at 17:57
Looks good to me, although I'm having some trouble getting my head around a few of the details, like why the `"^"` is needed, and why all `)` in `text` don't get matched and removed. Nice work though. – Josh O'Brien Apr 11 '12 at 18:26

score 3 · Answer 2 · edited Apr 11 '12 at 22:42

3

I assume you only want the count, or do you also want to remove :) from the string?

For the count you can do:

length(gregexpr(":)",text)[[1]])

which gives 2. A more generalized solution for a vector of strings is:

sapply(gregexpr(":)",text),length)

Edit:

Josh O'Brien pointed out that this also returns 1 of there is no :) since gregexpr returns -1 in that case. To fix this you can use:

sapply(gregexpr(":)",text),function(x)sum(x>0))

Which does become slightly less pretty.

edited Apr 11 '12 at 22:42

Josh O'Brien

159,210
26
366
455

answered Apr 11 '12 at 07:50

Sacha Epskamp

46,463
20
113
131

This is a great idea, but it needs a bit more work, as it fails for strings that don't contain any `":)"` strings at all. (Try out your functions with `text <- "ABC"`, for instance, to see that they both 'claim' that it contains 1 smiley face.) That's because `gregexpr()` returns `-1` for such a string, which has a length of 1. I do think that a fixed version of your approach would be a cleaner solution than the one I proposed... – Josh O'Brien Apr 11 '12 at 16:45
Cool. I'll file this away as a good example of the kind of problem where `gregexpr()` excels. – Josh O'Brien Apr 11 '12 at 18:21

BenBarnes · Answer 3 · 2012-04-11T07:56:34.650

This does the trick but might not be the most direct way:

mytext<- "i am happy today :):)"

# The following line inserts semicolons to split on
myTextSub<-gsub(":)", ";:);", mytext)

# Then split and unlist
myTextSplit <- unlist(strsplit(myTextSub, ";"))

# Then see how many times the smiley turns up
length(grep(":)", myTextSplit))

EDIT

To handle vectors of text with length > 1, don't unlist:

mytext<- rep("i am happy today :):)",2)
myTextSub<-gsub(":\\)", ";:\\);", mytext)
myTextSplit <- strsplit(myTextSub, ";")

sapply(myTextSplit,function(x){
  length(grep(":)", x))
})

But I like the other answers better.

Extracting specified word from a vector using R

3 Answers3

Edit: