I'd like to analyze gene expression in TCGA (The Cancer Genome Atlas) data downloaded from Broad Firehose. I have downloaded colorectal cancer (COAD) gene expression data. The data is a data frame (named data) of 20531 observations of 329 variables. The rows are individual genes and the normalized reads for each sample. The header, or column names, are TCGA sample codes. Sample codes look like this:
TCGA-3L-AA1B-01A-11R-A37K-07
and when I upload the table into R it turns into
TCGA.3L.AA1B.01A.11R.A37K.07
What I want to do is to pull out the samples where the fourth thing is 01
. So for example, for the above column header, the fourth thing is 01A
, and the important thing is that it's 01
, which indicates it's a primary tumor. How can I pull out all columns where the fourth thing in the column header is 01
?
Thanks!
head(data)
Hybridization.REF TCGA.3L.AA1B.01A.11R.A37K.07 TCGA.4N.A93T.01A.11R.A37K.07
1 ?|100130426 0.5174 0
2 ?|100133144 18.0851 4.4315
3 ?|100134869 15.764 4.2767
TCGA.4T.AA8H.01A.11R.A41B.07 TCGA.5M.AAT4.01A.11R.A41B.07 TCGA.5M.AAT5.01A.21R.A41B.07
1 0 0 0
2 9.8995 7.9174 12.2565
3 11.3032 18.7608 20.8826
TCGA.5M.AAT6.01A.11R.A41B.07 TCGA.5M.AATA.01A.31R.A41B.07 TCGA.5M.AATE.01A.11R.A41B.07
1 0 0 0
2 3.9637 7.2366 11.629
3 15.0672 11.4513 6.906