3

I have a text e.g

text<- "i am happy today :):)"

I want to extract :) from text vector and report its frequency

jan5
  • 1,129
  • 3
  • 17
  • 28

3 Answers3

5

Here's one idea, which would be easy to generalize:

text<- c("i was happy yesterday :):)",
         "i am happy today :)",
         "will i be happy tomorrow?")

(nchar(text) - nchar(gsub(":)", "", text))) / 2
# [1] 2 1 0
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 1
    You could also use the opposite for just one `nchar()` call: `nchar(gsub("[^:)]", "", text)) / 2` – Sacha Epskamp Apr 11 '12 at 09:03
  • @SachaEpskamp -- Unfortunately, that doesn't do quite the same thing, since it replaces everything except the the characters `:` and `)`, when you were really wanting to replace everything except the *string* `:)`. Try your idea with `text <- "A:"` to see what I mean. – Josh O'Brien Apr 11 '12 at 16:32
  • 1
    Right thanks. I am fairly new to regular expressions. I think this works? `nchar(gsub("^((?!:\\)).)*", "", text, perl=TRUE)) / 2` – Sacha Epskamp Apr 11 '12 at 17:57
  • Looks good to me, although I'm having some trouble getting my head around a few of the details, like why the `"^"` is needed, and why all `)` in `text` don't get matched and removed. Nice work though. – Josh O'Brien Apr 11 '12 at 18:26
3

I assume you only want the count, or do you also want to remove :) from the string?

For the count you can do:

length(gregexpr(":)",text)[[1]])

which gives 2. A more generalized solution for a vector of strings is:

sapply(gregexpr(":)",text),length)

Edit:

Josh O'Brien pointed out that this also returns 1 of there is no :) since gregexpr returns -1 in that case. To fix this you can use:

sapply(gregexpr(":)",text),function(x)sum(x>0))

Which does become slightly less pretty.

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
Sacha Epskamp
  • 46,463
  • 20
  • 113
  • 131
  • This is a great idea, but it needs a bit more work, as it fails for strings that don't contain any `":)"` strings at all. (Try out your functions with `text <- "ABC"`, for instance, to see that they both 'claim' that it contains 1 smiley face.) That's because `gregexpr()` returns `-1` for such a string, which has a length of 1. I do think that a fixed version of your approach would be a cleaner solution than the one I proposed... – Josh O'Brien Apr 11 '12 at 16:45
  • Cool. I'll file this away as a good example of the kind of problem where `gregexpr()` excels. – Josh O'Brien Apr 11 '12 at 18:21
1

This does the trick but might not be the most direct way:

mytext<- "i am happy today :):)"

# The following line inserts semicolons to split on
myTextSub<-gsub(":)", ";:);", mytext)

# Then split and unlist
myTextSplit <- unlist(strsplit(myTextSub, ";"))

# Then see how many times the smiley turns up
length(grep(":)", myTextSplit))

EDIT

To handle vectors of text with length > 1, don't unlist:

mytext<- rep("i am happy today :):)",2)
myTextSub<-gsub(":\\)", ";:\\);", mytext)
myTextSplit <- strsplit(myTextSub, ";")

sapply(myTextSplit,function(x){
  length(grep(":)", x))
})

But I like the other answers better.

BenBarnes
  • 19,114
  • 6
  • 56
  • 74