Get the most expressed genes from one .CEL file in R

Question

In R the Limma package can give you a list of differentially expressed genes.

How can I simply get all the probesets with highest signal intensity in the respect of a threshold?
Can I get only the most expressed genes in an healty experiment, for example from one .CEL file? Or the most expressed genes from a set of .CEL files of the same group (all of the control group, or all of the sample group).

If you run the following script, it's all ok. You have many .CEL files and all work.

source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GEOquery","affy","limma","gcrma"))
gse_number <- "GSE13887"
getGEOSuppFiles( gse_number )
COMPRESSED_CELS_DIRECTORY <- gse_number
untar( paste( gse_number , paste( gse_number , "RAW.tar" , sep="_") , sep="/" ), exdir=COMPRESSED_CELS_DIRECTORY)
cels <- list.files( COMPRESSED_CELS_DIRECTORY , pattern = "[gz]")
sapply( paste( COMPRESSED_CELS_DIRECTORY , cels, sep="/") , gunzip )
celData <- ReadAffy( celfile.path = gse_number )
gcrma.ExpressionSet <- gcrma(celData)

But if you delete all .CEL files manually but you leave only one, execute the script from scratch, in order to have 1 sample in the celData object:

> celData
AffyBatch object
size of arrays=1164x1164 features (17 kb)
cdf=HG-U133_Plus_2 (54675 affyids)
number of samples=1
number of genes=54675
annotation=hgu133plus2
notes=

Then you'll get the error:

Error in model.frame.default(formula = y ~ x, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'x')

How can I get the most expressed genes from 1 .CEL sample file?

I've found a library that could be useful for my purpose: the panp package.

But, if you run the following script:

if(!require(panp)) { biocLite("panp") }
library(panp)
myGDS <- getGEO("GDS2697")
eset <- GDS2eSet(myGDS,do.log2=TRUE)
my_pa <- pa.calls(eset)

you'll get an error:

> my_pa <- pa.calls(eset)
Error in if (chip == "hgu133b") { : the argument has length zero

even if the platform of the GDS is that expected by the library.

If you run with the pa.call() with gcrma.ExpressionSet as parameter then all work:

my_pa <- pa.calls(gcrma.ExpressionSet)
Processing 28 chips: ############################
Processing complete.

In summary, If you run the script you'll get an error while executing:

my_pa <- pa.calls(eset)

and not while executing

my_pa <- pa.calls(gcrma.ExpressionSet)

Why if they are both ExpressionSet?

> is(gcrma.ExpressionSet)
[1] "ExpressionSet"    "eSet"             "VersionedBiobase" "Versioned"       
> is(eset)
[1] "ExpressionSet"    "eSet"             "VersionedBiobase" "Versioned"

Please include a minimal [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Show what code you've tried so far and make it clear what you expect the desired output to be. — MrFlick, Sep 08 '14 at 12:43

score 2 · Answer 1 · answered Sep 08 '14 at 14:42

Your gcrma.ExpressionSet is an object of class "ExpressionSet"; working with ExpressionSet objects is described in the Biobase vignette

vignette("ExpressionSetIntroduction")

also available on the Biobase landing page. In particular the matrix of summarized expression values can be extracted with exprs(gcrma.ExpressionSet). So

> eset = gcrma.ExpressionSet  ## easier to display
> which(exprs(eset) == max(exprs(eset)), arr.ind=TRUE)
              row col
213477_x_at 22779  24
> sampleNames(eset)[24]
[1] "GSM349767.CEL"

Use justGCRMA() rather than ReadAffy as a faster and more memory efficient way to get to an ExpressionSet.

Consider asking questions about Biocondcutor packages on the Bioconductor support site where you'll get fast responses from knowledgeable members.

Get the most expressed genes from one .CEL file in R

1 Answers1