How to make sublist/extract expression data of candidate genes from normalized microarray list

Question

I have several processed microarray data (normalized, .txt files) from which I want to extract a list of 300 candidate genes (ILMN_IDs). I need in the output not only the gene names, but also the expression values and statistics info (already present in the original file). I have 2 dataframes:

normalizedData with the identifiers (gene names) in the first column, named "Name".
candidateGenes with a single column named "Name", containing the identifiers.

I've tried

1).

all=normalizedData  
subset=candidateGenes  
x=all%in%subset

2).

all[which(all$gene_id %in% subset)] #(as suggested in other bioinf. forum)#,

but it returns a Dataframe with 0 columns and >4000 rows. This is not correct, since normalizedData has 24 columns and compare them, but I always get error.

The key is to be able to compare the first column of all ("Name") with subset. Here is the info:

> class(all)   
> [1] "data.frame"    
> dim(all)    
> [1] 4312 24    
> str(all)    
> 'data.frame':4312 obs. of 24 variables: 
$ Name: Factor w/ 4312 levels "ILMN_1651253": 3401.. 
$ meanbgt:num 0 .. 
$ meanbgc: num .. 
$ cvt: num 0.11 .. 
$ cvc: num 0.23 ..
$ meant: num 4618 ..
$ stderrt: num 314.6 ..
$ meanc: num 113.8 ... 
$ stderrc: num 15.6 ...
$ ratio: num 40.6 ...     
$ ratiose: num 6.21 ...
$ logratio: num 5.34 ... 
$ tp: num 1.3e-04 ... 
$ t2p: num 0.00476 ... 
$ wilcoxonp: num 0.0809 ...
$ tq: num 0.0256 ...
$ t2q: num 0.165 ...
$ wilcoxonq: num 0.346 ...
$ limmap: num 4.03e-10 ... 
$ limmapa: num 4.34e-06 ... 
$ SYMBOL: Factor w/ 3696 levels "","A2LD1",..
$ ENSEMBL: Factor w/ 3143 levels "ENSG00000000003",..

and here is the info about subset:

> class(subset)    
[1] "data.frame"    
> dim(subset)   
 >[1] 328 1    
> str(subset) 'data.frame': 328 obs. of 1 variable:    
$ V1: Factor w/ 328 levels "ILMN_1651429",..: 177 286 47 169 123 109 268 284 234 186 ...

I really appreciate your help!

I guess 'all' is a (two-dimensional) data frame, so should be subset with two subscripts, `all[which(all$gene_id %in% subset),]`, where the 'empty' information after the comma indicates that you want to select all columns. — Martin Morgan, May 29 '14 at 18:37
Thanks for the suggestion @MartinMorgan, but now, in turn, I get 0 rows and 24 columns. What I want, is to find first the genes enlisted in candidateGenes (subset) in the normalizedData (over 4000 gene_ids in rows). Then, I want to extract the info related to each row. It should be a simple procedure, but I can't get it run. — pmedi, May 29 '14 at 18:47
I think you need to be more explicit about the structure of `all`, e.g., reporting the result of `class(all)` and `dim(all)` and perhaps `str(all)`. — Martin Morgan, May 29 '14 at 19:40
Or maybe `dput(head(all))` and `dput(head(subset))`. It's impossible from your description to have any idea of what data types these variables might be. It always helps to give a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with a small test data set so we test possible solutions. — MrFlick, May 29 '14 at 20:35
@pmedi You should edit the original question when people ask for more details rather than posting them in the comments. In the question they can be formatted to be actually readable. — MrFlick, May 29 '14 at 21:54
@MrFlick sorry for the x-fold comments... I am new here, and didn't know the rules. I've already edited my question. — pmedi, May 29 '14 at 22:28

score 0 · Accepted Answer · answered May 29 '14 at 23:02

0

What you need to do is

all[all$Name %in% subset$V1, ]

When using a data.frame, it's important to drill down the the correct column that has the data you actually want to use. You need to know which columns have the matching IDs. That the only way that this solution really differed from other suggested or other things you've tried.

It's also important to note that when subsetting a data.frame by rows, you need to use the [,] syntax where the vector before the comma indicates rows and the vector after indicates columns. Here, since you want all columns, we leave it empty.

answered May 29 '14 at 23:02

MrFlick

195,160
17
277
295

It worked! Thanks!!! I get the subset that I wanted. There is just a data shift: the first column contains as name meanbgt and as data all IDs and so on... the last column has no names, but the last data. But I can live with it. Thanks again! – pmedi May 30 '14 at 11:29
I fixed it. It was just a problem in saving the file. – pmedi May 30 '14 at 11:45

How to make sublist/extract expression data of candidate genes from normalized microarray list

1 Answers1