Regex get string between intervals underscores

Question

I've seen a lot of similar questions, but I wasn't able to get the desired output.

I have a string means_variab_textimput_x2_200.txt and I want to catch ONLY what is between the third and fourth underscores: textimput

I'm using R, stringr, I've tried many things, but none solved the issue:

my_string <- "means_variab_textimput_x2_200.txt"

str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*')
"means_variab_textimput"

str_extract(my_string, '^(?:([^_]+)_){4}')
"means_variab_textimput_x2_"
str_extract(my_string, '[_]*[^_]*[_]*[^_]*[_]*[^_]*\\.') ## the closer I got was this
"_textimput_x2_200."

Any ideas? Ps: I'm VERY new to Regex, so details would be much appreciated :)
additional question: can I also get only a "part" of the word? let's say, instead of textimput only text but without counting the words? It would be good to know both possibilities
this this one this one were helpful, but I couldn't get the final expected results. Thanks in advance.

You may use `str_replace` i.e. `str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1") [1] "textimput"` — akrun, Oct 03 '22 at 19:13
Or with `base R` `strsplit(my_string, "_")[[1]][3]# [1] "textimput"` — akrun, Oct 03 '22 at 19:14
Using your methods in `str_extract` is a bit troublesome for extracting from the third word. Because regex lookaround `(?<=` may need fixed length. Or we could use perl options (stringr is based on ICU). i.e. `regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))# [1] "textimput"` — akrun, Oct 03 '22 at 19:24
thank you very much!! ps: is there a way to extract only "part" of the word (additional q)? I've got that by putting ```str_sub(str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1"), start = 1, end = 4)``` , but is there a more straight-forward way? — Larissa Cury, Oct 03 '22 at 19:27
In fact, the idea was to get the whole word, I'm sorry for the confusion (I'm new to the forum, but I'm getting the rhythm), I've put the addition question in a bullet, bad practice?. It's good to have both options! Thank you very much! — Larissa Cury, Oct 03 '22 at 19:33

akrun · Accepted Answer · 2022-10-03T19:35:07.933

3

stringr uses ICU based regular expressions. Therefore, an option would be to use regex lookarounds, but here the length is not fixed, thus (?<= wouldn't work. Another option is to either remove the substrings with str_remove or use str_replace to match and capture the third word which doesn't have the _ ([^_]+) and replace with the backreference (\\1) of the captured word

library(stringr)
str_replace(my_string, "^[^_]+_[^_]+_([^_]+)_.*", "\\1") 
[1] "textimput"

If we need only the substring

str_replace(my_string, "^[^_]+_[^_]+_([^_]{4}).*", "\\1") 
[1] "text"

In base R, it is easier with strsplit and get the third word with indexing

strsplit(my_string, "_")[[1]][3]
# [1] "textimput"

Or use perl = TRUE in regexpr

regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]+", my_string, perl = TRUE))
# [1] "textimput"

For the substring

regmatches(my_string, regexpr("^([^_]+_){2}\\K[^_]{4}", my_string, perl = TRUE))
[1] "text"

edited Oct 03 '22 at 19:35

answered Oct 03 '22 at 19:29

akrun

874,273
37
540
662

thank you very much! I was trying to do another thing now based on your answer: ```str_replace(my_string, "^([^_]+)_[^_]+_([^_]+).*", "\\1\\2") [1] "NARR200" (the string is: NARR_G2_200_AB)```. I wanted "NARR 200 AB" (skipping G2). Where did I get it wrong? @akrun – Larissa Cury Oct 04 '22 at 19:57
I tried to use \\3 too, like ```^([^_]+)_[^_]+_([^_]+)_([^_]+)_.*", "\\1\\2\\3``` , but now I get the whole string – Larissa Cury Oct 04 '22 at 20:01
1

@LarissaCury Try `str_replace(my_string, "^([^_]+)_[^_]+_([^_]+)_([^_]+)", "\\1 \\2 \\3")# [1] "NARR 200 AB"` The `.*` at the end is removing the rest of characters. You need `(.*)` – akrun Oct 04 '22 at 20:06
1

thank you VERY much! (once more!). Regex seems to be an amazing tool, I'll work my way through it :) – Larissa Cury Oct 04 '22 at 20:08
Funny thing, I'm using it inside a ```str_glue()``` , ```str_glue("{str_replace(mystring, '^([^_]+)_[^_]+_([^_]+)_([^_]+)', '\\1 \\2 \\3')}")``` . Am I mistyping something? (it works outside glue just fine) – Larissa Cury Oct 04 '22 at 20:14
1

@LarissaCury I think you need to escape the backreferences i.e. `str_glue("{str_replace(mystring, '^([^_]+)_[^_]+_([^_]+)_([^_]+)', '\\\\1 \\\\2 \\\\3')}") # NARR 200 AB` – akrun Oct 06 '22 at 22:53
1

@LarissaCury Sorry, for the late reply. I did see your comment, but forgot to reply earlier – akrun Oct 07 '22 at 17:33
That was a good question as I didn't notice that earlier – akrun Oct 07 '22 at 17:35
I ended up using paste0 (which I don't like, I personally prefer ```stringr``` 's functions), but then I'll try again with your solution! – Larissa Cury Oct 07 '22 at 17:39
It is just that whenever there is `\\`, it needs escape within the glue – akrun Oct 07 '22 at 17:40

score 2 · Answer 2 · answered Oct 03 '22 at 19:30

Following up on question asked in comment about restricting the size of the extracted word, this can easily be achieved using quantification. If, for example, you want to extract only the first 4 letters:

sub("[^_]+_[^_]+_([^_]{4}).*$", "\\1", my_string)
[1] "text"

Regex get string between intervals underscores

2 Answers2

Linked