7

i guess this is a common problem, and i found quite a lot of webpages, including some from SO, but i failed to understand how to implement it.

I am new to REGEX, and I'd like to use it in R to extract the first few words from a sentence.

for example, if my sentence is

z = "I love stack overflow it is such a cool site"

id like to have my output as being (if i need the first four words)

[1] "I love stack overflow"

or (if i need the last four words)

[1] "such a cool site"

of course, the following works

paste(strsplit(z," ")[[1]][1:4],collapse=" ")
paste(strsplit(z," ")[[1]][7:10],collapse=" ")

but i'd like to try a regex solution for performance issues as i need to deal with very huge files (and also for the sake of knowing about it)

I looked at several links, including Regex to extract first 3 words from a string and http://osherove.com/blog/2005/1/7/using-regex-to-return-the-first-n-words-in-a-string.html

so i tried things like

gsub("^((?:\S+\s+){2}\S+).*",z,perl=TRUE)
Error: '\S' is an unrecognized escape in character string starting ""^((?:\S"

i tried other stuff but it usually returned me either the whole string, or the empty string.

another problem with substr is that it returns a list. maybe it looks like the [[]] operator is slowing things a bit (??) when dealing with large files and doing apply stuff.

it looks like the Syntax used in R is somewhat different ? thanks !

Community
  • 1
  • 1
Fagui Curtain
  • 1,867
  • 2
  • 19
  • 34
  • 2
    You need to use double escapes in R regex. `\S` -> `\\S` – Wiktor Stribiżew Nov 22 '15 at 14:57
  • You could also try `stringi::stri_extract_all_words(z)[[1]][1:4]` which is easier to use and doesn't require to know regex. Though you will get the words a separate values. – David Arenburg Nov 22 '15 at 15:11
  • Couldn't you just use the same idea that I had shared [in your earlier question](http://stackoverflow.com/questions/33785594/manipulate-char-vectors-inside-a-data-table-object-in-r)? You just need to double up on your backslashes in R, as already pointed out by @stribizhev. – A5C1D2H2I1M1N2O1R2T1 Nov 22 '15 at 15:25
  • yes @Ananda Mahto sorry, i am slow to learn, now i understand I need the double backslash – Fagui Curtain Nov 22 '15 at 15:26

2 Answers2

8

You've already accepted an answer, but I'm going to share this as a means of helping you understand a little more about regex in R, since you were actually very close to getting the answer on your own.


There are two problems with your gsub approach:

  1. You used single backslashes (\). R requires you to escape those since they are special characters. You escape them by adding another backslash (\\). If you do nchar("\\"), you'll see that it returns "1".

  2. You didn't specify what the replacement should be. Here, we don't want to replace anything, but we want to capture a specific part of the string. You capture groups in parentheses (...), and then you can refer to them by the number of the group. Here, we have just one group, so we refer to it as "\\1".

You should have tried something like:

sub("^((?:\\S+\\s+){2}\\S+).*", "\\1", z, perl = TRUE)
# [1] "I love stack"

This is essentially saying:

  • Work from the start of the contents of "z".
  • Start creating group 1.
  • Find non-whitespace (like a word) followed by whitespace (\S+\s+) two times {2} and then the next set of non-whitespaces (\S+). This will get us 3 words, without also getting the whitespace after the third word. Thus, if you wanted a different number of words, change the {2} to be one less than the number you are actually after.
  • End group 1 there.
  • Then, just return the contents of group 1 (\1) from "z".

To get the last three words, just switch the position of the capturing group and put it at the end of the pattern to match.

sub("^.*\\s+((?:\\S+\\s+){2}\\S+)$", "\\1", z, perl = TRUE)
# [1] "a cool site"
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • thanks. @Ananda Mahto. could you give the regex for the last 4 words using the same function `sub` ? – Fagui Curtain Nov 23 '15 at 00:33
  • 1
    @FaguiCurtain, I just swapped the reference from being fixed to the start of the line to the end instead, like: `^.*((?:\\S+\\s+){2}\\S+)$`. Change "2" to "3" to get 4 words instead of 3. – A5C1D2H2I1M1N2O1R2T1 Nov 23 '15 at 02:36
3

For getting the first four words.

library(stringr)
str_extract(x, "^\\s*(?:\\S+\\s+){3}\\S+")

For getting the last four.

str_extract(x, "(?:\\S+\\s+){3}\\S+(?=\\s*$)")
Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • or `sub("^\\s*((?:\\S+\\s+){3}\\S+).*", "\\1", x)` – Avinash Raj Nov 22 '15 at 15:01
  • 1
    can you give me the correct regex using the function `sub`. i made a test on a 10,000 sample and the `sub` function from base R is like 30 times faster than `str_extract` from the `library(stringr)`. thanks – Fagui Curtain Nov 23 '15 at 00:32
  • I'm stupid but don't know how to tweak the function. `sub("(?:\\S+\\s+){3}\\S+(?=\\s*$)",replacement="",z,perl=TRUE)` is returning me `"I love stack overflow it is "` which is everything BUT the last 4 words... – Fagui Curtain Nov 23 '15 at 01:10
  • `sub('^.* (\\w+\\s+\\w+\\s+\\w+\\s+\\w+)$', '\\1', z)` works for the last 5 strings, but i don't understand how to use properly the `{...}` to make for a simpler expression in this case – Fagui Curtain Nov 23 '15 at 01:41
  • 1
    like `sub('^.* (\\w+(?:\\s+\\w+){4})$', '\\1', z)` – Avinash Raj Nov 23 '15 at 02:11