1

I need a way to use 'Or' statements with entire words outside capture groups in tidyr::extract, like in the next example.

Suppose i have the next strings:

string1 <- data.frame (col = "asdnajksdn**thingA**asdnaksjdnajksn")
string2 <- data.frame (col = "asdnajksdn**itemA**asdnaksjdnajksn")

i want to use tidyr::extract() to extract 'A' and 'B' with the same regular expressions, but i DONT want to extract 'word' or 'thing', the desired output would be:

string1 %>% extract(col = 'col', regex = regex, into = "var")
> NewColumn
  "A"

string2 %>% extract(col = 'col', regex = regex, into = "NewColumn")
> NewColumn
  "B"

The answer would be something like that:

extract(string, col = "col", into = "NewColumn",
        regex = "(word)|(thing)(.)")

But i can't do that because it would result in:

NewColumn NA
word      A

I know that in the example i could just use something like

"[ti][ht][ie][nm]g?(.)"

but i'm looking for a more general solution.

Nícolas Pinto
  • 363
  • 2
  • 14

1 Answers1

3

Since tidyr extract() extracts the capturing group values, you can group the alternatives that you do not want to extract with a non-capturing group.

The syntax of a non-capturing group is (?:...):

If you do not need the group to capture its match, you can optimize this regular expression into Set(?:Value)?. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group. The question mark after the opening bracket is unrelated to the question mark at the end of the regex.

So, use something like:

> library(tidyr)
> string1 <- data.frame (col = "asdnajksdnthingAasdnaksjdnajksn")
> string1 %>% extract(col, c("NewColumn"), "(?:word|thing)(.)")
  NewColumn
1         A
Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563