Regular expression in R - extract only match

Question

My strings look like as follows:

crb_gdp_g_100000_16_16_ftv_all.txt
crb_gdp_g_100000_16_20_fweo2_all.txt
crb_gdp_g_100000_4_40_fweo2_galt_1.txt

I only want to extract the part between f and the following underscore (in these three cases "tv", "weo2" and "weo2").

My regular expression is:

regex.f = "_f([[:alnum:]]+)_"

There is no string with more than one part matching the pattern. Why does the following command not work?

sub(regex.f, "\\1", "crb_gdp_g_100000_16_16_ftv_all.txt")

The command only removes "_f" from the string and returns the remaining string.

Are you sure you need `weo` and not `weo2`? What if there is `_fw2eo2_`? — Wiktor Stribiżew, Jul 27 '17 at 12:25
Can there be more than one occurrence? `_ftv_fwe4_`? The answers below will yield different results in these cases. — Wiktor Stribiżew, Jul 27 '17 at 12:29
Your `sub` does not work because you are *replacing*, but your regex does not match the *whole* string. You need to match the whole string to remove it, and only keep what you need using backreferences to the capturing groups inside the string replacement pattern. — Wiktor Stribiżew, Jul 27 '17 at 12:36
But shouldn't I only get the match, since I put it in parentheses? — tho_mi, Jul 27 '17 at 12:38
That is why Benjamin's approach is more natural in this case, when you need to *match*. It is not Python where you can use `re.findall` and it will fetch you only the capturing group value. `sub` advantage is that it keeps the value unmodified if the regex does not match while `regmatches` will just find no match. — Wiktor Stribiżew, Jul 27 '17 at 12:39

MLEN · Answer 1 · 2017-07-27T13:08:55.637

4

Can easily be achived with qdapRegex

df <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
"crb_gdp_g_100000_16_20_fweo2_all.txt", 
"crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

library(qdapRegex)
rm_between(df, "_f", "_", extract=TRUE)

edited Jul 27 '17 at 13:08

answered Jul 27 '17 at 12:42

MLEN

2,162
2
20
36

The initial boundary is `_f` – Wiktor Stribiżew Jul 27 '17 at 13:07

akrun · Answer 2 · 2017-07-27T12:29:55.120

3

We can use sub extract the strings by matching the characterf followed by one or more characters that are not an underscore or numbers ([^_0-9]+), capture as a group ((...)), followed by 0 or more numbers (\\d*) followed by an _ and other characters. Replace with the backreference (\\1) of the captured group

sub(".*_f([^_0-9]+)\\d*_.*", "\\1", str1)
#[1] "tv"  "weo" "weo"

data

str1 <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
    "crb_gdp_g_100000_16_20_fweo2_all.xml",
     "crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

edited Jul 27 '17 at 12:29

answered Jul 27 '17 at 12:21

akrun

874,273
37
540
662

Thanks, works for all cases. But why does my approach not work? Where's my mistake? – tho_mi Jul 27 '17 at 12:33
1

@tho_mi what your approach does is find, for example `ftv_` and replace it with `tv`. @akrun is looking for any string that ends in an `f`, followed by `tv`, followed by any string that starts with a `_`, and replaces it with `tv`. The difference is that you haven't told `sub` what to do with the leading and trailing characters, so it leaves them in tact. – Benjamin Jul 27 '17 at 12:39
I see. I thought using parantheses only returns the stuff between them, discarding the rest. – tho_mi Jul 27 '17 at 12:41

Benjamin · Answer 3 · 2017-07-27T13:18:45.083

3

My usual regex for extracting the text between two characters comes from https://stackoverflow.com/a/13499594/1017276, which specifically looks at extracting text between parentheses. This approach only changes the parentheses to f and _.

x <- c("crb_gdp_g_100000_16_16_ftv_all.txt",
       "crb_gdp_g_100000_16_20_fweo2_all.xml",
       "crb_gdp_g_100000_4_40_fweo2_galt_1.txt",
       "crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml")

regmatches(x,gregexpr("(?<=_f).*?(?=_)", x, perl=TRUE))

Or with the stringr package.

library(stringr)

str_extract(x, "(?<=_f).*?(?=_)")

edited to start the match on _f instead of f.

NOTE

akrun's answer runs a few milliseconds faster than the stringr approach, and about ten times faster than the base approach. The base approach clocks in at about 100 milliseconds for a character vector of 10,000 elements.

edited Jul 27 '17 at 13:18

answered Jul 27 '17 at 12:26

Benjamin

16,897
6
45
65

Sorry, my examples were bad (/insufficient) and your code doesn't work for my third example. – tho_mi Jul 27 '17 at 12:31
1

@tho_mi I think [it works for all cases](https://regex101.com/r/rvpx9g/1). – Wiktor Stribiżew Jul 27 '17 at 12:34
Yes, (now?) it works for all cases, but in some cases it returns an empty string first. – tho_mi Jul 27 '17 at 12:36
@Benjamin, why does the approach of akrun work but mine doesn't? – tho_mi Jul 27 '17 at 12:37
@tho_mi I'd be interested in a case where you are getting the empty string first. My first thought on such a case is that no match was found. – Benjamin Jul 27 '17 at 12:44
In case of the following string I get an empty string first, followed by "lin": crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml – tho_mi Jul 27 '17 at 12:49
@tho_mi [You said there is always a single occurrence.](https://stackoverflow.com/questions/45350677/regular-expression-in-r-extract-only-match#comment77662870_45350677) So, you want to only get the last one? In `crb_gdp_g_100000_20_tbf_16_nqa_8_flin_galt_2.xml`, there is `f_` and `_flin_`. – Wiktor Stribiżew Jul 27 '17 at 12:54
Does that matter? The first case doesn't begin with an underscore? – tho_mi Jul 27 '17 at 13:02
Ok, so you need `"(?<=_f).*?(?=_)"` then – Wiktor Stribiżew Jul 27 '17 at 13:07
Ah, yes. I now use my example, just with ".*" at the beginning and the end. This way everything works fine. Thanks :) – tho_mi Jul 27 '17 at 13:11

ewwink · Answer 4 · 2017-07-27T13:13:42.157

2

update: capture match using str_match

library(stringr)  
m <- str_match("crb_gdp_g_100000_16_20_fweo2_all.txt", "_f([[:alnum:]]+)_")
print(m[[2]])
# weo2

your regex not work because missing starting and ending match .* and use \w for shorthand [:alnum:]

sub(".*_f(\\w+?)_.*", "\\1", "crb_gdp_g_100000_16_20_fweo2_all.txt")

edited Jul 27 '17 at 13:13

answered Jul 27 '17 at 12:48

ewwink

18,382
2
44
54

Ah, thanks! That means by default sub tries to match the whole string and not just substrings? – tho_mi Jul 27 '17 at 12:51
I think you have misunderstanding, `sub` is for replacing string, and to extract match you need `regmatches` – ewwink Jul 27 '17 at 12:58
Ok, I see. Does that also mean that the parantheses don't work in case of regmatches? Using regmatches with the regular expression from my code I get things like "_fweo2_"? Is this conclusion correct? – tho_mi Jul 27 '17 at 13:05
not correct, use `str_match` to capture group, see updated answer, it using your regex. – ewwink Jul 27 '17 at 13:15
@tho_mi You can use `regmatches` together with `regexec` to return captured subexpressions, but the process is a bit more cumbersome. Here, `regmatches("crb_gdp_g_100000_16_20_fweo2_all.txt", regexec("_f([[:alnum:]]+)_", "crb_gdp_g_100000_16_20_fweo2_all.txt"))[[c(1, 2)]]` will return the same result as the `str_match` call above. – lmo Jul 27 '17 at 13:31

score 1 · Answer 5 · answered Oct 09 '19 at 12:25

We could use the package unglue :

library(unglue)
txt <- c("crb_gdp_g_100000_16_16_ftv_all.txt", 
       "crb_gdp_g_100000_16_20_fweo2_all.txt", 
       "crb_gdp_g_100000_4_40_fweo2_galt_1.txt")

pattern <-
  "crb_gdp_g_100000_{=\\d+}_{=\\d+}_f{x}_{=.+?}.txt"
unglue_vec(txt,pattern)
#> [1] "tv"   "weo2" "weo2"

^{Created on 2019-10-09 by the reprex package (v0.3.0)}

Regular expression in R - extract only match

5 Answers5

data

NOTE