1

I have a vector of data in R that has entries like data = BURR_WK_94_91and I want to extract the number that falls between the two underscores. So in this case get 94. The strings are of variable length so I can't use a starting position.

I'm almost there with this answer

library(qdap)
genXtract(data, "_", "_")

But that gives me extra data that I don't need. Is there a way to query if the string is a number between the underscores then extract it?

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
adkane
  • 1,429
  • 14
  • 29

3 Answers3

1

Yes, for example using lookbehind and lookahead with regex.

data = "BURR_WK_94_91"
gsub(".*(?<=_)(\\d+)(?=_).*", "\\1", data, perl = TRUE)

[1] "94"

Or, using stringr package, you only have to match the exact group.

stringr::str_extract_all(data, "(?<=_)((\\d+)*)(?=_)")

[[1]]
[1] "94"
erocoar
  • 5,723
  • 3
  • 23
  • 45
1

One approach would be to use:

gsub(".*_(\\d+)_.*", "\\1", "BURR_WK_94_91", perl = T)

(\\d+) - denotes a capture group - capture any number of digits 
\\1 - back reference to the first capture group
.*_ - any number of characters ending with a _
_.* - any number of characters starting with a _

So basically what you telling the function to do is to replace everything with the capture group.

if there is exactly 2 digits:

 gsub(".*_(\\d{2})_.*", "\\1", "BURR_WK_94_91", perl = T)
missuse
  • 19,056
  • 3
  • 25
  • 47
0

You can use str_match from the stringr package

stringr::str_match(data, "_([0-9]{2})_") %>%
  magrittr::extract(,2)
Jagge
  • 938
  • 4
  • 21