0

I have a large table in PostgreSQL (>2000 M rows) that has to be queried as quickly as possible. It represents measurements of gene expression in biological samples. The thing is that sometimes the measurement is directly on the gene ("probe" is then NULL) and sometimes the measurement is done via "probes" for a gene ("gene" is then still set). One gene can have multiple probes. There is no other table that contains the gene-probe relationship.

CREATE TABLE "gene_measurements" (
  "gene" INTEGER NOT NULL REFERENCES "genes" ON DELETE CASCADE,
  "sample" INTEGER NOT NULL REFERENCES "samples" ON DELETE CASCADE,
  "probe" INTEGER REFERENCES "probes" ON DELETE CASCADE,
  "value" REAL NOT NULL
);

Common queries include getting the expression for all genes in a given sample, getting the expression for a given gene/probe in all samples, or getting the expression of a given gene/probe in a given sample.

As it is now I have the following covering indexes. It works fine, but it's very space consuming.

CREATE INDEX "gene_measurements_gene_sample_value_index" ON "gene_measurements" ("gene", "sample", "value");
CREATE INDEX "gene_measurements_sample_gene_value_index" ON "gene_measurements" ("sample", "gene", "value");
CREATE INDEX "gene_measurements_sample_probe_value_index" ON "gene_measurements" ("sample", "probe", "value");
CREATE INDEX "gene_measurements_probe_sample_value_index" ON "gene_measurements" ("probe", "sample", "value");

Is there something clever I can do to get a neater and/or smaller implementation while maintaining speed? Thanks!

wildplasser
  • 43,142
  • 8
  • 66
  • 109
Rasmus
  • 69
  • 7

2 Answers2

0

You can choose an arbitrary threshold between space and time. Right now, you have indexed the whole table four times. This obviously consumes a lot of space.

You could get rid of some data from the indexes in exchange for faster runtime:

  • For example, you could remove value from all indexes. But, then a lookup of the data becomes necessary additionally to a lookup in the index.
  • You could also remove some indexes altogether. For example, depending on your data, you might remove either (sample, gene) or (sample, probe). This removes one complete coverage of the data, while still allowing you to use the sample part for queries with conditions on sample and the removed column. Again, the case you removed is then not as fast as before.

If your goal is to achieve minimum runtime at all cost, then all these proposals are not for you. I don't think there is anything in the PostgreSQL universe right now to solve your issue.

Since your data is simple and your use cases restricted, you can consider solutions other than PostgreSQL. Especially, you basically want only a B-Tree data structure. (Or multiple.) There are other solutions to build such a data structure, e.g., QDBM. Still, you would need to build multiple of these structures to optimize for each of your select types. The achievable savings in space I would consider to be not very high – basically, you could get rid of the data but none of the indexes. Therefore, you could roughly save 1/5 of your current storage size, at the cost of restricted functionality and additional complexity in your software ecosystem.

You have to decide what you need, what you want and what you want to sacrifice for these goals. Considering what I have written down here, I would stick to PostgreSQL.

Ulrich Thomas Gabor
  • 6,584
  • 4
  • 27
  • 41
  • Thanks! Then I think I stick with the current implementation. Does `WHERE PROBE IS NOT NULL` save me space/time for the indexes involving probe? – Rasmus May 09 '20 at 08:50
  • Oh yes, since `"probe"` can be NULL, it may be enough to use these two indexes: `("probe", "sample", "gene", "value")` and `("probe", "gene", "sample", "value")`. – Ulrich Thomas Gabor May 09 '20 at 08:58
  • No, it's common that I query just by gene and ignore whether it's measured using a probe or not. I guess I can't have an index like `("probe", "gene", "sample", "value")` and for some queries specify that probe can be anything? I don't really know how indexes are implemented :) – Rasmus May 09 '20 at 09:05
  • But, it basically boils down to just optimize the indexes based on your queries. You can use the [documentation as a starting point](https://www.postgresql.org/docs/12/indexes-examine.html). – Ulrich Thomas Gabor May 09 '20 at 09:06
  • No, I don't think this is possible with B-Tree indexes – Ulrich Thomas Gabor May 09 '20 at 09:11
0

An SQL-table really needs a primary key. Theoretically, a table without a key is meaningless. (in practice, a table with 3G rows lacking a PK is a disaster)

In your case, the natural key appears to be the combination of the (gene_id,sample_id,probe_id) columns. Values for these three columns are needed to uniquely address the value.

The problem is your if probe is absent; measurement was directly on the gene anti-constraint. This will forbid a three column key. Removing this exception will allow a multi-column primary key. Now, the data-trick is to insert one dummy row into probe, for instance, with id=0.

INSERT INTO probe(probe_id, probe_when, probe_name)
 VALUES( 0, '1901-01-01 00:00:00', 'Dummy probe');

And now UPDATE the gene_measurements changing probe IS NULL to probe=0.


CREATE TABLE gene_measurements (
  gene INTEGER NOT NULL REFERENCES genes(gene_id) ON DELETE CASCADE
  , sample INTEGER NOT NULL REFERENCES samples(sample_id) ON DELETE CASCADE
  , probe INTEGER NOT NULL REFERENCES probes (probe_id)
  , value REAL NOT NULL
        , PRIMARY KEY ( gene_id, sample_id,probe_id)
);

Maybe add some other indexes, too, with different ordering, to help specific queries, eg:

CREATE UNIQUE INDEX ON gene_measurements (sample_id,gene_id,probe_id);

And you'll need a supporting index for the probe FK, any index with probe as its first column will do:

CREATE INDEX ON gene_measurements (probe_id, ...);
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • Thanks! I've reconsidered and will follow your suggestions. – Rasmus May 11 '20 at 17:45
  • It does involve some work on the existing data. As a side-effect you may catch some duplicates, generating even more work. Crucial point is the (lack of) cardinality of (some of) your three *dimensions*, but I don't know your data. – wildplasser May 11 '20 at 18:44
  • I don't think I understand the part about cardinality. There are ~50k samples, 500k genes, and 500k probes. ~20% of the rows would have the dummy probe. Is this asymmetry an issue? – Rasmus May 11 '20 at 18:59