I have a large table in PostgreSQL (>2000 M rows) that has to be queried as quickly as possible. It represents measurements of gene expression in biological samples. The thing is that sometimes the measurement is directly on the gene ("probe" is then NULL) and sometimes the measurement is done via "probes" for a gene ("gene" is then still set). One gene can have multiple probes. There is no other table that contains the gene-probe relationship.
CREATE TABLE "gene_measurements" (
"gene" INTEGER NOT NULL REFERENCES "genes" ON DELETE CASCADE,
"sample" INTEGER NOT NULL REFERENCES "samples" ON DELETE CASCADE,
"probe" INTEGER REFERENCES "probes" ON DELETE CASCADE,
"value" REAL NOT NULL
);
Common queries include getting the expression for all genes in a given sample, getting the expression for a given gene/probe in all samples, or getting the expression of a given gene/probe in a given sample.
As it is now I have the following covering indexes. It works fine, but it's very space consuming.
CREATE INDEX "gene_measurements_gene_sample_value_index" ON "gene_measurements" ("gene", "sample", "value");
CREATE INDEX "gene_measurements_sample_gene_value_index" ON "gene_measurements" ("sample", "gene", "value");
CREATE INDEX "gene_measurements_sample_probe_value_index" ON "gene_measurements" ("sample", "probe", "value");
CREATE INDEX "gene_measurements_probe_sample_value_index" ON "gene_measurements" ("probe", "sample", "value");
Is there something clever I can do to get a neater and/or smaller implementation while maintaining speed? Thanks!