7

I'm really putting time into learning regex and I'm playing with different toy scenarios. One setup I can't get to work is to grab from the beginning of a string to n occurrence of a character where n > 1.

Here I can grab from the beginning of the string to the first underscore but I can't generalize this to the second or third underscore.

x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")

gsub("_.*$", "", x)

Here's what I'm trying to achieve with regex. (`sub`/`gsub`):

## > sapply(lapply(strsplit(x, "_"), "[", 1:2), paste, collapse="_")
## [1] "a_b" "1_2" "<_?"

#or

## > sapply(lapply(strsplit(x, "_"), "[", 1:3), paste, collapse="_")
## [1] "a_b_c" "1_2_3" "<_?_."

Related post: regex from first character to the end of the string

Community
  • 1
  • 1
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519

5 Answers5

5

Here's a start. To make this safe for general use, you'll need it to properly escape regular expressions' special characters:

x <- c("a_b_c_d", "1_2_3_4", "<_?_._:", "", "abcd", "____abcd")

matchToNth <- function(char, n) {
    others <- paste0("[^", char, "]*") ## matches "[^_]*" if char is "_"
    mainPat <- paste0(c(rep(c(others, char), n-1), others), collapse="")
    paste0("(^", mainPat, ")", "(.*$)")
}

gsub(matchToNth("_", 2), "\\1", x)
# [1] "a_b"  "1_2"  "<_?"  ""     "abcd" "_" 

gsub(matchToNth("_", 3), "\\1", x)
# [1] "a_b_c" "1_2_3" "<_?_." ""      "abcd"  "__"   
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • I was actually working on something similar. My approach is close but not the same as yours. Good call of the safe approach, I use this in qdap; `library(qdap); genX` – Tyler Rinker Apr 09 '13 at 19:22
  • 1
    @TylerRinker -- Just be aware that the answer you have accepted doesn't work for strings like those in the following: `x <- c("_a_b", "a__b")`. – Josh O'Brien Apr 09 '13 at 19:26
3

How about:

gsub('^(.+_.+?).*$', '\\1', x)
# [1] "a_b" "1_2" "<_?"

Alternatively you can use {} to indicate the number of repeats...

sub('((.+_){1}.+?).*$', '\\1', x)  # {0} will give "a", {1} - "a_b", {2} - "a_b_c" and so on

So you don't have to repeat yourself if you wanted to match the nth one...

Fabrício Matté
  • 69,329
  • 26
  • 129
  • 166
Justin
  • 42,475
  • 9
  • 93
  • 111
  • 2
    second regex is missing a `.+?` and should be `sub('((.+_){2}.+?).*$', '\\1', x)` – eddi Apr 09 '13 at 21:26
  • @eddi the second regex wasn't missing it cause I didn't know how to make it work properly! And I think it would be closer with my edit. Thanks for pointing me in the right direction. – Justin Apr 09 '13 at 21:39
  • 1
    without the `.+` before `?` you'll get an extra `_` at the end which doesn't seem to match with OP's examples – eddi Apr 09 '13 at 22:05
  • True, but you also get 3 characters :) That second one was just a shot in the dark and I knew wasn't working. Feel free to fix it! – Justin Apr 09 '13 at 22:40
  • @eddi Your suggested edit has been rejected (+2-3) due to the suggested edits review system's lack of context, hence I've re-edited the answer directly applying your edit. If interested, you can see my ongoing thread on [Meta](http://meta.stackexchange.com/q/175758/186879) for a suggested edits review queue improvement. – Fabrício Matté Apr 09 '13 at 23:09
  • I like this idea (and have tried unsuccessfully to fix it), but it fails pretty badly on something like `x <- "_a__b"`, for which it returns a string containing three rather than one `"_"`'s. – Josh O'Brien Apr 10 '13 at 05:25
  • @Justin I have used this solution in a function in the qdap package. Could you email me your last name (see qdap for my email) so I can properly attribute credit? – Tyler Rinker Jun 19 '13 at 12:56
1

second underscore in perl style regex:

/^(.?_.?_)/

and third:

/^(.*?_.*?_.*?_)/
ennuikiller
  • 46,381
  • 14
  • 112
  • 137
1

Maybe something like this

x
## [1] "a_b_c_d" "1_2_3_4" "<_?_._:"

gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){1}", x)))
## [1] "a" "1" "<"

gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){2}", x)))
## [1] "a_b" "1_2" "<_?"

gsub("(.*)_", "\\1", regmatches(x, regexpr("([^_]*_){3}", x)))
## [1] "a_b_c" "1_2_3" "<_?_."
CHP
  • 16,981
  • 4
  • 38
  • 57
1

Using Justin's approach this was what I devised:

beg2char <- function(text, char = " ", noc = 1, include = FALSE) {
    inc <- ifelse(include, char, "?")
    specchar <- c(".", "|", "(", ")", "[", "{", "^", "$", "*", "+", "?")
    if(char %in% specchar) {
        char <- paste0("\\", char)
    }
    ins <- paste(rep(paste0(char, ".+"), noc - 1), collapse="")
    rep <- paste0("^(.+", ins, inc, ").*$")
    gsub(rep, "\\1", text)
}

x <- c("a_b_c_d", "1_2_3_4", "<_?_._:")
beg2char(x, "_", 1)
beg2char(x, "_", 2)
beg2char(x, "_", 3)
beg2char(x, "_", 4)
beg2char(x, "_", 3, include=TRUE)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Are these results really what you want them to be? `x <- "a____b"; beg2char(x, "_", 2); beg2char(x, "_", 1)` – Josh O'Brien Apr 09 '13 at 20:07
  • @JoshO'Brien I think so but maybe you see a corner case or something I don't what are you thinking specifically? – Tyler Rinker Apr 09 '13 at 20:11
  • Oh I see form your comment above. Yes for my purposes but I think that for others your approach is what they'd be after. – Tyler Rinker Apr 09 '13 at 20:14
  • Fair enough. I guess I just don't quite understand what behavior you're actually after, or at least don't have a picture of the domain of inputs on which you want it to work. (Not that I'm asking for further clarification.) Cheers. – Josh O'Brien Apr 09 '13 at 20:20