2

I've this basic data frame:

I want to search in a column, for a SKU (8 digits), put this in a variable (capturing group), and then put it in a new column: "SKU_solo".

I don't need the "\1" but the first 8 number digits. How to make the capturing group within my code?

This is my code:

I'm using "dplyr"

urls_na <- urls_na %>%
           mutate(SKU_solo = NA, #initialize the new column
                  SKU_solo = ifelse(grepl("([0-9]+)", Page), "\\1",SKU_solo))




                     Page                   Categoria   Page.Views       SKU_solo
1   5   /Cajon_Criolla_20141024                 #N/A             7           \1 
2   6   /Linon_20141115_20141130                #N/A           564           \1
3   7   /Cat/LIQUID                             #N/A             1           NA
4   8   /c_puertas_20141106_20141107            #N/A            34           \1 
5   9   /C_Puertas_3_20141017_20141018          #N/A             2           \1
6   10  /c_puertas_navidad_20141204_20141205    #N/A        187319           \1

Desired ouput:

                     Page                   Categoria   Page.Views       SKU_solo
1   5   /Cajon_Criolla_20141024                 #N/A             7       20141024
2   6   /Linon_20141115_20141130                #N/A           564       20141115
3   7   /Cat/LIQUID                             #N/A             1           NA
4   8   /c_puertas_20141106_20141107            #N/A            34       20141106
5   9   /C_Puertas_3_20141017_20141018          #N/A             2       20141017
6   10  /c_puertas_navidad_20141204_20141205    #N/A        187319       20141204 

NOTES:

1) ifelse and grepl help to make the capturing and replacement. How ever, it just return: \1 as string.

2) There could be another numbers, like in line 5. But the important one is the first SKU (8 digits group).

UPDATE:

As you see, i can get "\1" to print in the SKU_solo column. I know there are other ways of doing this, but what is wrong with my code?

I want to use the "Capturing group" characteristic from Regex. I've read that, it assigns values 1 to ... from left to right when something is within "()". In my code: ifelse(grepl("([0-9]+)", Page), "\\1",SKU_solo)) ... ([0-9]+) should be assigend number 1... that is why after i use: "\1" to make reference to it. I don't get, why it does not work, and only puts : "\1" in the "SKU_solo" Column.

Omar Gonzales
  • 3,806
  • 10
  • 56
  • 120

3 Answers3

0

There are several problems in your code. First you don't set the number of numbers. Second you don't tell it to be "greedy" to match the first item, done with the (.*?).

You need to use the regular expression

     "(.*?)_([0-9]{8})"

To fix your problem 2)

But then your "capturing group" thing does not work because it is meant to work within functions like sub or gsub. You cannot pass it between the test and the yes argument of ifelse() and you need to repeat the gsub twice if you want to keep your construct.

    matchingExp <- "(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$"
    urls_na <- urls_na %>% 
                 mutate(SKU_solo=NA,
                 SKU_solo=ifelse(grepl(matchingExp,Page),sub(matchingExp,"\\2",Page),Page))

But this is very inefficient as you call the regexp two times. To avoid that you can use the fact that your result has to be a numeric, and that if it doesn't match your file names cannot be numeric only (you can always add an initial "a" if you have a doubt):

    urls_na <- urls_na %>% mutate(SKU_solo=as.numeric(sub("(.*?)_([0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9])(.*)$","\\2",Page)))

All of the above works fine here.

@dave note that I actually have a difference between n times [0-9] and [0-9]{n}. I posted a question here.

Community
  • 1
  • 1
cmbarbu
  • 4,354
  • 25
  • 45
0

You can use the stringr package for this:

library(stringr)
urls_na <- urls_na %>%
           mutate(SKU_solo = NA, #initialize the new column
                  SKU_solo = str_match(Page, "([0-9]{8})")[,1])

Note that I changed your regex too, since you are looking for an 8 digit number.

Note that this:

str_match(Page, "([0-9]{8})")[,1]

Will return back the complete match. If you would like to return back individual groups you can by using indices 2 onward.

From the stringr docs:

Value:

     character matrix. First column is the complete match, followed by
     one for each capture group
dave
  • 12,406
  • 10
  • 42
  • 59
  • thanks for this. I'm investigating the stringr package. However, i need to group them, cause as you see, i've 2 groups of 8 digits numbers in some lines. After i want to use the second group of 8 digits... so in my code,, i would just need to reference the second group. – Omar Gonzales Feb 25 '15 at 17:31
  • `str_match` gives that to you. From the doc: `First column is the complete match, followed by one for each capture group`. So you should be able to index in to get whatever group you want. – dave Feb 25 '15 at 17:33
  • from your code, i see: 1) Look in Page column, 2) Look a group of 8 digits. However, i don't see the grouping part, that will be necessary to me in the future. This is very important. **Please, add some comments to your answer, maybe i'm wrong. – Omar Gonzales Feb 25 '15 at 17:39
0

dplyr:

urls_na <- urls_na %>%
           mutate( SKU_solo = ifelse(grepl('_([0-9]{8})$',Page), 
                                    gsub('^.*_(\\d{8})$','\\1',Page),
                                    NA))

base R:

urls_na$SKU_solo <- ifelse(grepl('_([0-9]{8})$',urls_na$Page), 
                           gsub('^.*_(\\d{8})$','\\1',urls_na$Page)
                           NA)
Jthorpe
  • 9,756
  • 2
  • 49
  • 64
  • your second argument to `ifelse( , '\\1', )` is simply a literal string, whereas the second argument *should* evaluate to whatever you want when the first argument evaluates to `TRUE`. In my solutions, the second argument is the call to `substr(Page,...)` which is evaluates to the desired SKU. – Jthorpe Feb 25 '15 at 19:08
  • Note also that you don't need to initialize your column; `ifelse()` is vectorized so you can just call `mutate(SKU_solo = ifelse(grepl("([0-9]+)", Page), SomethingThatEvalutesToTheSKU,NA))` – Jthorpe Feb 25 '15 at 19:13
  • i get what you say. That is why i want to use the "Capturing group" characteristic from Regex. I've read that, it assigns values 1 to ... from left to right when something is within "()". In my code: `ifelse(grepl("([0-9]+)", Page), "\\1",SKU_solo))` ... `([0-9]+)` should be assigend number 1... that is why after i use: "\\1" to make reference to it. I don't get, why it does not work. – Omar Gonzales Feb 25 '15 at 19:22
  • your're correct about `"\\1"` referring to the first captured group, but it is only used in the replacement pattern which is the second argument of `sub()` and `gsub()`. – Jthorpe Feb 25 '15 at 21:43
  • So, my code does not work, cause i need to use "sub()" instead of grepl? Thanks. – Omar Gonzales Feb 25 '15 at 21:57
  • You need to use `grepl` to determine if there is a match and if you want to use the `"\\1"` pattern, then you can use `gsub()` in the second argument of `ifelse()`. See edits above. – Jthorpe Feb 25 '15 at 23:02