7

i wanted to download gene expression data derived from generated by microarray experiments. i do not know too much about this subject, but as i understand, rows often correspond to genes and columns corresponds to samples. ideally, i expect a matrix of gene expression data.

i've been searching on the internet, and although it may seem like there are many places to download such data, when i actually do download the data, i do not get the matrix of gene expression. could someone please let me know if there is a place or how to download gene expression data in the format that i expect above?

any help is appreciated.

hello_there_andy
  • 2,039
  • 2
  • 21
  • 51
Jane Wayne
  • 535
  • 9
  • 21
  • This question is not related to programming. Please ask it on BioStar http://biostar.stackexchange.com/ – gotgenes Mar 23 '12 at 15:43
  • @gotgenes thanks! i did really try to see if there were other stackexchange channels before posting here. but now i know for sure! luckily i got great responses and the appropriate site now. – Jane Wayne Mar 29 '12 at 23:02

2 Answers2

5

If you look at e.g. this entry in the Gene Expression Omnibus, one of the file formats is "TXT" and contains a matrix like you are asking for, after some metadata.

Jouni K. Seppänen
  • 43,139
  • 5
  • 71
  • 100
  • for that TXT file, are the columns (i.e. GSM339455, GSM339456, GSM339457, etc...) genes and the rows samples? – Jane Wayne Mar 23 '12 at 05:48
  • i'm looking at the cluster analysis. it seems GSM's are samples and the rows do correspond to genes. Could you please explain the naming conventions? i.e. Why use GSM for column headers and then 998_at or 9890_at for row identifiers? – Jane Wayne Mar 23 '12 at 06:04
  • The GSM numbers are accession ids for samples (you can find each sample in the GEO with the id). The "series platform id" listed in the file is GPL7144, and if you query GEO with that id, you get a mapping from the row identifiers to various other ways of referring to genes. – Jouni K. Seppänen Mar 23 '12 at 06:57
  • do you know if you are able to query by dimensions? i.e. i am only interested in data sets that have over 20,000 genes and 1,000 samples? – Jane Wayne Mar 23 '12 at 07:12
  • You can set search limits at http://www.ncbi.nlm.nih.gov/gds/limits and one of the options is sample number, just set that to something like 1000:1000000. There's no field for number of genes, but modern microarray gene expression studies will have over 20k genes, and you can limit by platform to platforms that have sufficient coverage. – Jouni K. Seppänen Mar 23 '12 at 14:05
5

In principle microarray data can be expressed (please pardon the pun) as a matrix with samples as columns and rows as genes. In practice it is a good bit more complicated to derive such a representation for the raw data of an experiment. If you just get a pre-processed dataset you have little guarantee that the raw data was processed in a way that makes it comparable to other experiments or that the underlying raw data was of sufficiently high quality.

You are also going to need high quality metadata to derive any meaning from the data matrix. What were the biological conditions and sources from which the samples were derived? What genes do the probes on the particular array used correspond to? (Note that 9890_at is "probeset id", a unique identifier of a molecular probe of a particular sequence design which then needs to be mapped to a gene, different probes for the same gene won't give exactly the same response.)

The public microrarray databases therefore provide a lot of additional information in addition to a processed data matrix. In addition to GEO that has already been mentioned I would recommend ArrayExpress which in my opinion has the better search interface.

The tool of choice to work with microarray data for many is the bioconductor suite of software for the statistical programming language R.

Bioconductor provides APIs to download raw data with accompanying metadata from both repositories, see the GEO bioc package and ArrayExpress bioc package.

Both packages, in common with most bioconductor software come with excellent "vignettes" that introduce the software: GEO bioc vignette and Arrayexpress bioc vignette

Those vignettes should also give you examples of taking the raw data and deriving "Esets" (expression sets) from the raw data. At that point you can access the gene expression matrix in the bioconductor Eset object, and you have an object and APIs to interrogate the necessary metadata.

Note that there are different types of microarray. I would recommend starting with data from Affymetrix arrays as they have probably the most straightforward analysis APIs.

Alex Stoddard
  • 8,244
  • 4
  • 41
  • 61