I'm working on a biology lab and I have to design a database in order to store many DNA microarray experiments results.
Each experiment is componed of many microarrays (about ten in average) and each microarray contains over 5 millions probes. Each probe is mapped to a particular gene id, of course the same probe match the same gene_id in all the experiments. The aim is to store the intensity values of each microarray in order to be able to quickly retrieve the intensity values of the probes of a particular gene id in a particular experiment.
In fact a simple mysql table would be enough, it would look like that :
intensity table : |probe_id|experiment_id|microarray_id|gene_id|intensity_value
With a primary key composed of (probe_id, experiment_id, microarray_id, gene_id)
Here's the problem : each experiment has many microarray which has over 5 millions probes. With 1000 experiments, 10 microarrays on average (wich is a low estimation, some have hundreds), its 1000 * 10 * 5M = 50 Billions rows. I guess it would be slow. And I have absolutely no idea about how to handle a billions rows mysql table. Is that possible ? Any tips ?
I'm also curious about noSQL databases. I never used cassandra but it seems to me that it would be perfect for this task, am I right ? I can imagine a shema like this :
{
experiment_id_1:{ <- thats a super collumnFamilly ?
gene_id_1:{ <- thats a collumnFamilly ?
probe_id_1:{ value_microarray_1, value_microarray_2, ... }, <- thats a superCollumn ?
probe_id_2:{ value_microarray_1, value_microarray_2, ... },
probe_id_3:{ value_microarray_1, value_microarray_2, ... },
...
},
gene_id_2:{
probe_id_1:{ value_microarray_1, value_microarray_2, ... },
probe_id_2:{ value_microarray_1, value_microarray_2, ... },
probe_id_3:{ value_microarray_1, value_microarray_2, ... },
...
}
}
experiment_id_2{
...
}
...
}
Am I rigth ? Would it fit the cassandra model ? Would it be efficient ? What do you think noSQL guru :)
Thanks.