extract until next "_" if contains

Question

Is there a way to extract part of string, when there is a match (everything up to the next underscore) "_"?

From: mycampaign_s22uhd4k_otherinfo I need: s22uhd4k.
From: my_campaign_otherinfo_s22jumpto_otherinfo , I would need: s22jumpto

data:

df <- structure(list(a = c("mycampaign_s22uhd4k_otherinfo", "my_campaign_otherinfo_s22jumpto_otherinfo"
), b = c(1, 2)), class = "data.frame", row.names = c(NA, -2L))

Not completely clear what the logic is. In the first example the second element is being extracted, in the other the third. Do they always begin with `s22` or are they always second from the end? — Ritchie Sacramento, Jul 15 '22 at 02:55
It's difficult to work out the 'rule'; is it the second last 'group' that you want to capture? E.g. does this make sense? `gsub(df$a, pattern = ".*_(.*)_.*", replacement = "\\1", perl = TRUE)` (output: [1] "s22uhd4k" "s22jumpto") — jared_mamrot, Jul 15 '22 at 02:55
@RitchieSacramento not always beging with `s22` but for simplicity I put it that way... the thing is that they're not always be in the same position. — Omar Gonzales, Jul 15 '22 at 02:58

jared_mamrot · Accepted Answer · 2022-07-17T11:59:04.130

2

Thanks Omar, based on your update/comment, this regex will solve your problem:

df <- structure(list(a = c("mycampaign_s22uhd4k_otherinfo",
                           "my_campaign_otherinfo_s22jumpto_otherinfo",
                           "e220041_pe_mx_aon_aonjulio_conversion_shop_facebook-network_ppl_primaria_s22test512gb_hotsale_20220620"
), b = c(1, 2, 3)), class = "data.frame", row.names = c(NA, -3L))

gsub(df$a, pattern = ".*(s22[^_]+(?=_)).*", replacement = "\\1", perl = TRUE)
#> [1] "s22uhd4k"     "s22jumpto"    "s22test512gb"

^{Created on 2022-07-17 by the reprex package (v2.0.1)}

Explanation:

.*(s22[^_]+(?=_)).*

.* match all characters up until the first capture group

(s22 the first capture group starts with "s22"

[^_]+ after "s22", match any character except "_"

(?=_) until the next "_" (positive look ahead)

) close the first capture group

.* match all remaining characters

Then, the replacement = "\\1" means to just print the captured text (the part you want)

edited Jul 17 '22 at 11:59

answered Jul 15 '22 at 02:59

jared_mamrot

22,354
4
21
46

Both work, but second option is safer as you condition it to start with "s22". How does the first option capture the s22 part? Why does it not capture: `otherinfo` ? – Omar Gonzales Jul 15 '22 at 03:03
hi Jared, why this won't work: ´test <- "e220041_pe_mx_aon_aonjulio_conversion_shop_facebook-network_ppl_primaria_s22test512gb_hotsale_20220620"´ gsub(test, pattern = ".*(s22.*)_.*", replacement = "\\1", perl = TRUE) It's returning: s22test512gb_hotsale – Omar Gonzales Jul 15 '22 at 15:04
Thanks for the update @OmarGonzales, I have edited my answer and added an explanation for how the regex is capturing the text. If this doesn't work with your actual data, please edit your question and I'll have another look at it. – jared_mamrot Jul 17 '22 at 12:00

extract until next "_" if contains

1 Answers1