5

I have variable names in the form:

PP_Sample_12.GT

or

PP_Sample-17.GT

I'm trying to use string split to grep out the middle section: ie Sample_12 or Sample-17. However, when I do:

IDtmp <- sapply(strsplit(names(df[c(1:13)]),'_'),function(x) x[2])
IDs <- data.frame(sapply(strsplit(IDtmp,'.GT',fixed=T),function(x) x[1]))

I end up with Sample for PP_Sample_12.GT.

Is there another way to do this? Maybe using a pattern/replace kind of function ? Though, not sure if this exists in R (but I think this might work with gsub)

MAPK
  • 5,635
  • 4
  • 37
  • 88
user2726449
  • 607
  • 4
  • 11
  • 23
  • The reason you are not finding the error is because you have too many layers of abstraction. Instead of trying to do everything at once, work on the goal of turning `PP_Sample-17.GT` into what you want, **then** generalize. – Señor O May 06 '14 at 19:39

4 Answers4

6

Using this input:

x <- c("PP_Sample_12.GT", "PP_Sample-17.GT")

1) strsplit. Replace the first underscore with a dot and then split on dots:

spl <- strsplit(sub("_", ".", x), ".", fixed = TRUE)
sapply(spl, "[", 2)

2) gsub Replace the prefix (^[^_]*_) and the suffix (\\.[^.]*$") with the empty string:

gsub("^[^_]*_|\\.[^.]*$", "", x)

3) gsubfn::strapplyc extract everything between underscore and dot.

library(gsubfn)
strapplyc(x, "_(.*)\\.", simplify = TRUE)
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
5

Here's a gsub that will extract everything after the first _ and before the last .

x<-c("PP_Sample-12.GT","PP_Sample-17.GT")
gsub(".*_(.*)\\..*","\\1", x, perl=T)
Thomas
  • 43,637
  • 12
  • 109
  • 140
MrFlick
  • 195,160
  • 17
  • 277
  • 295
1

This grabs the 2nd element of each part of the list that was split and then simplifies it into a vector by subsetting the function [, using sapply to call this function for each element of the original list.

x <- c('PP_Sample_12.GT', 'PP_Sample-17.GT')
sapply(strsplit(x, '(?:_(?=\\D)|\\.GT)', perl = T), '[', 2)

[1] "Sample_12" "Sample-17"
hwnd
  • 69,796
  • 4
  • 95
  • 132
0

If they all start and end with the same characters and those characters aren't anywhere in the middle part of your string, the gsub expression is simple:

> x <- c("PP_Sample-12.GT","PP_Sample-17.GT")
> gsub('[(PP_)|(.GT)]','',x)
[1] "Sample-12" "Sample-17
Thomas
  • 43,637
  • 12
  • 109
  • 140