0

I have a vector with the following elements:

myvec<- c("output.chr10.recalibrated", "output.chr11.recalibrated",
"output.chrY.recalibrated")

I want to selectively extract the value after chr and before .recalibrated and get the result.

Result:

10, 11, Y
BenBarnes
  • 19,114
  • 6
  • 56
  • 74
MAPK
  • 5,635
  • 4
  • 37
  • 88

4 Answers4

7

You can do that with a mere sub:

> sub(".*?chr(.*?)\\.recalibrated.*", "\\1", myvec)
[1] "10" "11" "Y" 

The pattern matches any symbols before the first chr, then matches and captures any characters up to the first .recalibrated, and then matches the rest of the characters. In the replacement pattern, we use a backreference \1 that inserts the captured value you need back into the resulting string.

See the regex demo

As an alternative, use str_match:

> library(stringr)
> str_match(myvec, "chr(.*?)\\.recalibrated")[,2]
[1] "10" "11" "Y" 

It keeps all captured values and helps avoid costly unanchored lookarounds in the pattern that are necessary in str_extract.

The pattern means:

  • chr - match a sequence of literal characters chr
  • (.*?) - match any characters other than a newline (if you need to match newlines, too, add (?s) at the beginning of the pattern) up to the first
  • \\.recalibrated - .recalibrated literal character sequence.
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Just for my own mind sanity: `[.]chr([^.]*)[.]` should be far enough as regex in this specific case. (not dot character just after chr included between two dots). – Tensibai Apr 21 '16 at 12:01
  • 1
    Side note: upvoted for the completeness of answer clearly explaining how it works, may worth a note about `*?` for non greedy match to be even better IMO) – Tensibai Apr 21 '16 at 12:14
  • 1
    Sorry, was very busy. `*?` is a [**lazy quantifier**](http://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions), right, I just wanted to explain the pattern in human words (*... up to the **first** `.recalibrated`*). I would not use `([^.]*)` because it won't match if there is a dot between `chr` and `.recalibrated`. **If** that cannot occur, then yes, I would. – Wiktor Stribiżew Apr 21 '16 at 12:29
  • Fair enough ;) Now OP's has the two side of how his/her Q can be interpreted and that's already far too much attention IMO. – Tensibai Apr 21 '16 at 12:39
3

Both answers failing in case of slightly different inputs like whatever.chr10.whateverelse.recalibrated here's my own approach only differing on the regex part with sub:

sub(".*[.]chr([^.]*)[.].*", "\\1", myvec)

what the regex does is:

  • .*[.]chr match as much as possible until finding '.chr' literraly
  • ([^.]*) capture everything not a dot after chr (could be replaced by \\d+ to capture only numeric values, requiring at least one digit present
  • [.].* match the rest of the line after a literal dot

I prefer the character class escape of dots ([.]) on the backslash escape (\\.) as it's usually easier to read when you're back on the regex, that's my my opinion and not covered by any best practice I know of.

Tensibai
  • 15,557
  • 1
  • 37
  • 57
2

We can use str_extract to do this. We match one of more characters (.*) that follow 'chr' ((?<=chr)) and before the .recalibrated ((?=\\.recalibrated)).

 library(stringr)
 str_extract(myvec, "(?<=chr).*(?=\\.recalibrated)")
 #[1] "10" "11" "Y" 

Or use gsub to match the characters until chr or (|) that starts from .recalibrated to the end ($) of the string and replace it with ''.

 gsub(".*\\.chr|\\.recalibrated.*$", "", myvec)
 #[1] "10" "11" "Y" 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thank you so much. What is the difference between `?<=chr` and `?=\\`? – MAPK Apr 21 '16 at 10:28
  • 1
    @MAPK We are using regex lookarounds to select one or more elements between those `chr` and reclibrated – akrun Apr 21 '16 at 10:28
  • 2
    @MAPK: Lookarounds (zero-width assertions that only check if some text can be matched or not matched before or after the current location in string) are necessary with `str_extract` because this function does not keep captured values. – Wiktor Stribiżew Apr 21 '16 at 10:32
0

Looks like XY problem. Why extract? If this is needed in further analysis steps, we could for example do this instead:

for(chrN in c(1:22, "X", "Y")) {
  myVar <- paste0("output.chr", chrN, ".recalibrated")
  #do some fun stuff with myVar 
  print(myVar)
}
zx8754
  • 52,746
  • 12
  • 114
  • 209