2

I'm struggling to figure out a regex that can match the last triple underscore in a string that is preceded by a letter or number. Eventually, I want to be able to extract the characters before and after this match. I also need to accomplish this with base R

x <- c("three___thenfour____1", 
             "only_three___k")

The closest I've gotten is trying to adapt Regex Last occurrence?

sub("^(.+)___(?:.(?!___))+$", "\\1", x, perl = TRUE)
[1] "three___thenfour_" "only_three" 

But what I really want to be able to get is

c("three___thenfour", "only_three") and c("_1", "k")

(The only way I've managed to get those results so far is through strsplit, but it feels clunky and inefficient)

do.call("rbind", 
        lapply(strsplit(x, "___"), 
               function(x){ 
                 c(paste0(head(x, -1), collapse = "___"), tail(x, 1))
               }))

     [,1]               [,2]
[1,] "three___thenfour" "_1"
[2,] "only_three"       "k" 

Any suggestions?

Benjamin
  • 16,897
  • 6
  • 45
  • 65
  • Did you try the `char{min,max}` form? That is `_{2,}` (at least 2 `_`) – Ted Lyngmo Apr 15 '23 at 03:23
  • [`^(.*?)___(?!.*___)(.*)`](https://regex101.com/r/X5bXi8/1), perhaps? – InSync Apr 15 '23 at 03:59
  • What should happen if `___` preceded by something other that a letter or digit? For example, what should be matched in `a___b:___c`? – markalex Apr 15 '23 at 06:50
  • @markalex, well nuts. The system that is sending me these strings automatically converts all non-alphanumerics to an underscore. Which means `a___b:___c` is going to get sent to me as `a___b____c` and I will want it split as `a___b____` and `c`. I might have to reconsider this strategy entirely. (Thanks for pointing that out, by the way) – Benjamin Apr 15 '23 at 11:16
  • Actually, I'm luckier than that. The system feeding me the strings is concatenating two fields. The pattern is `[first]___[second]`. Fortunately, the system disallows anything but alphanumeric in the last position of `[first]`. – Benjamin Apr 15 '23 at 11:31
  • More ideas: [`(.*)(?<!_)___(.*)`](https://regex101.com/r/AndQW8/1) – bobble bubble Apr 15 '23 at 11:54

3 Answers3

2

You can try this

strsplit(x, '(?<!_)_{3}(?!.*(?<!_)_{3})', perl=TRUE)
# [[1]]
# [1] "three___thenfour" "_1"              
# 
# [[2]]
# [1] "only_three" "k" 

and finally to get the vectors

strsplit(x, '(?<!_)_{3}(?!.*(?<!_)_{3})', perl=TRUE) |>
  as.data.frame() |> unname() |> asplit(1)
# [[1]]
# [1] "three___thenfour" "only_three"      
# 
# [[2]]
# [1] "_1" "k" 
jay.sf
  • 60,139
  • 8
  • 53
  • 110
2

This matches your current requirements:

x <- c(
  "three___thenfour____1", 
  "only_three___k",
  "test___test___test___test",
  "1_____test"
)
             
gsub("^(.*?)___(?!.*___)(.*)$", "\\1 \\2", x, perl = TRUE)

It outputs:

[1]"three___thenfour _1" [2]"only_three k" [3]"test___test___test test" [4]"1 __test" 

Explanation:

  • ^(.*?)___ - match anything at beginning non-greedily followed by ___ into 1st group
  • (?!.*___) - after that, don't allow ___ preceded by anything, negative lookahead is used for that purpose
  • (.*)$ - match anything after until end of string into 2nd group
Destroy666
  • 892
  • 12
  • 19
  • 1
    A couple of questions, if I may: 1. Why do you need lazy modifier inside of negative lookahead? 2. Why do you need lazy modifier for second group, if you are capturing everything till the end? 3. This wording "preceded by anything after" seems a little bit confusing, even If I know what regex supposed to do. – markalex Apr 15 '23 at 06:47
  • Good points, corrected. For the other non-greedy ones it was just copy paste that I forgot to change in the end. Both work here but I think greedy matches are faster in this case? Not sure, actually, normall there should be less backtracking with lazy ones, no? – Destroy666 Apr 15 '23 at 07:37
  • I believe second group greedy should be faster (unless optimized by engine). Lookahead should be the same in perfomance, but it just looked weird, and I decided to ask if you had some intentions that I don't understand or it was something unintentional. – markalex Apr 15 '23 at 08:34
  • I'm accepting this answer in large part because it fits well into my current work flow without having to refactor much of anything. Thank you for your time. – Benjamin Apr 15 '23 at 11:21
2

You can use regexpr with .*[^_]___ what matches the last ___ starting to count from left - .*___ would match the last starting from right. Extract the first part with regmatches and the last with substring.

i <- regexpr(".*[^_]___", x)
sub("___$", "", regmatches(x, i))
#[1] "three___thenfour" "only_three"

substring(x, attr(i, "match.length")+1L)
#[1] "_1" "k"
GKi
  • 37,245
  • 2
  • 26
  • 48