13

I have many filenames which look like:

txt= "MA0051_IRF2.xml"

I want to extract IRF2 which is between "_" and ".". How do I do this in R?

MAPK
  • 5,635
  • 4
  • 37
  • 88
Paul.j
  • 794
  • 2
  • 8
  • 17

4 Answers4

31

To achieve this, you need a regexp that

  • matches an (optional) arbitrary string in front of the _ : .*
  • matches a literal _ : [_]
  • matches everything up to (but not including) the next . and stores it in capturing group no. 1 : ([^.]+)
  • matches a literal . : [.]
  • matches an (optional) arbitrary string after the . : .*

In your call to gsub, you then

  • use the regular expression we built in the previous step
  • replace the whole string with the contents of the first capturing group: \\1 (we need to escape the backslash, hence the double backslash)

Example:

gsub(".*[_]([^.]+)[.].*", "\\1", "MA0051_IRF2.xml")
Frank Schmitt
  • 30,195
  • 12
  • 73
  • 107
6

an other possibility with the stringr package:

 str_extract(x, perl("(?<=_)(.+)(?=\\.)"))
droopy
  • 2,788
  • 1
  • 14
  • 12
  • 1
    If I am not wrong, current versions of `stringr` do not longer use the function `perl`, the syntax for stringr v>1.4 would be `str_extract(x, "(?<=_)(.+)(?=\\.)")`. – Alf Pascu Mar 28 '22 at 15:26
4
gsub(".*_(.*)\\..*", "\\1", txt)
##"IRF2"
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
4

Here's a possible solution that doesn't require regex knowledge:

txt <- "MA0051_IRF2.xml"

library(qdap)
genXtract(txt, "_", ".")

## _  :  . 
##  "IRF2" 
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • This is cool, but you can make sure you can extract as a character to add it in mutate? For example `mutate(extract=genXtract(txt,"-",".")` – LDT Oct 19 '22 at 07:41