1

I'm sure this is a silly question, I have a couple of strings such as data_PB_Belf.csv and I need to exctract only PB_Belf (and so on). How can I exctract everything after the first _ up to . (preferably using stringr ) ?

data
[1] "data_PB_Belf.csv" "data_PB_NI.csv" ...

str_replace(data[1], "^[^_]+_([^_]+)_.*", "\\1") ## the closer I got, it returns "PB"
  • I tried to adapt the code from here, but I wasn't able to. I'm sure that there's a way to use str_replace() or str_sub() or str_extract(), I just can't get the right Regex. Thanks in advance!
Larissa Cury
  • 806
  • 2
  • 11

2 Answers2

3

We may match the one or more characters that are not a _ ([^_]+) from the start (^) of the string, followed by an _, then capture the characters that are not a dot (.) (([^.]+)) followed by . (dot is metacharacter, so escape \\), followed by any characters and replace with the backreference (\\1) of the captured group

sub("^[^_]+_([^.]+)\\..*", "\\1", data)
[1] "PB_Belf" "PB_NI" 

Or with str_replace

library(stringr)
str_replace(data, "^[^_]+_([^.]+)\\..*", "\\1")
[1] "PB_Belf" "PB_NI" 
akrun
  • 874,273
  • 37
  • 540
  • 662
1

There are other, simpler, options available too. For example, you can use, as you mentioned, str_extract in conjunction with lookarounds:

library(stringr)
str_extract(x, "(?<=_).*?(?=\\.)")
[1] "PB_Belf" "PB_NI"

Here we are using:

  • (?<=_): positive look behind to assert that what we want to extract must be preceded by _ and
  • (?=\\.): positive look ahead to assert that what we want to extract must be followed by a dot.

Data:

x <- c("data_PB_Belf.csv", "data_PB_NI.csv")
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34