2

I'd like to analyze gene expression in TCGA (The Cancer Genome Atlas) data downloaded from Broad Firehose. I have downloaded colorectal cancer (COAD) gene expression data. The data is a data frame (named data) of 20531 observations of 329 variables. The rows are individual genes and the normalized reads for each sample. The header, or column names, are TCGA sample codes. Sample codes look like this:

TCGA-3L-AA1B-01A-11R-A37K-07

and when I upload the table into R it turns into

TCGA.3L.AA1B.01A.11R.A37K.07

What I want to do is to pull out the samples where the fourth thing is 01. So for example, for the above column header, the fourth thing is 01A, and the important thing is that it's 01, which indicates it's a primary tumor. How can I pull out all columns where the fourth thing in the column header is 01?

Thanks!

head(data)

  Hybridization.REF TCGA.3L.AA1B.01A.11R.A37K.07 TCGA.4N.A93T.01A.11R.A37K.07
1       ?|100130426                       0.5174                            0
2       ?|100133144                      18.0851                       4.4315
3       ?|100134869                       15.764                       4.2767
  TCGA.4T.AA8H.01A.11R.A41B.07 TCGA.5M.AAT4.01A.11R.A41B.07 TCGA.5M.AAT5.01A.21R.A41B.07
1                            0                            0                            0
2                       9.8995                       7.9174                      12.2565
3                      11.3032                      18.7608                      20.8826
  TCGA.5M.AAT6.01A.11R.A41B.07 TCGA.5M.AATA.01A.31R.A41B.07 TCGA.5M.AATE.01A.11R.A41B.07
1                            0                            0                            0
2                       3.9637                       7.2366                       11.629
3                      15.0672                      11.4513                        6.906
busybear
  • 10,194
  • 1
  • 25
  • 42
  • [This question](https://stackoverflow.com/questions/25923392/select-columns-based-on-string-match-dplyrselect) looks like what you're after. The question specifically mentions `dplyr`, but there's at least one answer that just uses base R, if that's what you need. – A. S. K. Jan 29 '19 at 04:24

1 Answers1

0

The general apporach is to split your column names at the . with strsplit and check if the 4th component is equal to '01'. Since . is a special character, you'll have to use \\.. Here's what it might look like:

df[sapply(strsplit(colnames(df), '\\.'), '[', 4) == '01']
busybear
  • 10,194
  • 1
  • 25
  • 42