i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.
I am new to REGEX, and I'd like to use it in R to extract the first few words from a sentence.
for example, if my sentence is
z = "I love stack overflow it is such a cool site"
id like to have my output as being (if i need the first four words)
[1] "I love stack overflow"
or (if i need the last four words)
[1] "such a cool site"
of course, the following works
paste(strsplit(z," ")[[1]][1:4],collapse=" ")
paste(strsplit(z," ")[[1]][7:10],collapse=" ")
but i'd like to try a regex solution for performance issues as i need to deal with very huge files (and also for the sake of knowing about it)
I looked at several links, including Regex to extract first 3 words from a string and http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in-a-string.html
so i tried things like
gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE)
Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"
i tried other stuff but it usually returned me either the whole string, or the empty string.
another problem with substr is that it returns a list. maybe it looks like the [[]]
operator is slowing things a bit (??) when dealing with large files and doing apply stuff.
it looks like the Syntax used in R is somewhat different ? thanks !