How to count and save occurrences in a very large Postgres database?

Question

My experience with Postgres is pretty much nil, my computer is from 2010. So not the fastest, but still going strong.

I looked at recommended answers, but I don't get them (yet) - and I'd like to understand what I am doing.

I have a Postgres database with columns names and number_of_occurrences (plus some others that have unique values for each name). The database has about 300 million rows (will increase to about 6 billion), and MOST names only occur once, while some can occur thousands or even millions of times.

So far the database only contains the names, and it is my job now to fill the number_of_occurrences column.

What would be the most efficient way of doing that?

Normally, I would imagine I should first use DISTINCT on names and then for each distinct value run a count, and then save that into a new table, and finally discard the old table.

However, that isn't possible as there are also other unique values for each row in the table, so the count has to be put back into the original table.

I'm actually trying to help out my wife with a problem at work. She works on the X Chromosome which is about 140 Million "letters" long. The problem is to find PAIRS of primers (fragments of 20 letters) that fulfill certain properties (eg GC content, Tm, etc), are within a certain distance of each other (eg 100-150 letters), and occur a few times on the X Chromosome but do not occur on the Y Chromosome (and ideally also not on any other Chromosomes).

First step was to create the primers (all 140 Million - 20 for the X Chromosome as the primers go from position 1-20, 2-21, 3-22, etc) and calculate the relevant values that are required later on (eg GC content, Tm for anyone with a biological background).

Primary key of the table is id, an auto-generated UUID (will be the same for all other tables).

No indices have been created yet - as I understand it keeping the index updated while adding rows is expensive, and it is much faster to create the complete table and then index it.

Schema and other tables are irrelevant to the problem and just clutter the mind.

I thought number_of_occurrences should be done AFTER the table has been created, but with the idea of making number_of_occurrences a foreign key I can actually set or increase the value each time I add a row, so the problem disappears.

Typical: as soon as you post a question you get an idea. I just thought I should probably change number_of_occurrences to a foreign key so updating the number in one row changes it for all rows … newbie error … BUT: that should mean that I just go over all rows ONCE and add 1 to each occurrence! I don't think you can get more efficient than that … — Markus Winter, Apr 30 '23 at 00:01
"*there are also other unique values for each row in the table*" - what is the primary key of the table? What indices are defined on it? It would help if you could [edit] your question to include the whole schema (as a `CREATE TABLE` statement for each of the relevant involved tables) — Bergi, Apr 30 '23 at 00:16
You're probably looking for a window function: `count(*) OVER (PARTITION BY name)` — Bergi, Apr 30 '23 at 00:17
"*What would be the most efficient way of doing that?*" - doing what exactly? Adding the `number_of_occurrences` column and filling it once with values? Keeping it up to date when inserting new rows? — Bergi, Apr 30 '23 at 00:18
For a 6 billion row dataset I sugget you get off your ancient computer and consider some cost effective cloud solutions such as Databricks in Azure or AWS. Otherwise you'll be waiting days or weeks for your answer. — Nick.Mc, Apr 30 '23 at 01:56

Erwin Brandstetter · Answer 1 · 2023-04-30T01:24:51.403

1

Use count(*) as plain aggregate function in a subquery to an UPDATE and join back on names.

Assuming the new column already exists:

UPDATE tbl t
SET    number_of_occurrences = ct.ct
FROM  (
   SELECT names, count(*) AS ct
   FROM   tbl
   GROUP  BY names
   ) ct
WHERE  t.names = ct.names;

You don't need any indexes for this operation, a sequential scan will do. You are right to create indexes later. (Though, if an index on (names) exists, it might be used.)

Actually, if you are free to do so, it's probably cheaper to just create a new table. Because updating in Postgres means writing a new row version, and that's expensive, especially when updating every row of a big table. See:

Assuming the new column does not exist, yet:

CREATE TABLE tl2 AS 
SELECT *, count(*) OVER (PARTITION BY name) AS number_of_occurrences
FROM   tbl
ORDER  BY names;  -- optional, but possibly beneficial.

DROP TABLE tbl;

This time using count() as window function, which does not aggregate rows, but keeps each input row.

Of course, the whole operation hardly makes sense, if you are going to add more rows later, and have to recount and update every time. In this case I would not add this column at all. Have an index on (names), and do the count on the fly or in a view or materialized view.

If names is a long string, consider a smaller expression index on (hashtextextended(names)). See:

Why doesn't my UNIQUE constraint trigger?

Aside: "names" is a dodgy name.

edited Apr 30 '23 at 01:24

answered Apr 30 '23 at 00:47

Erwin Brandstetter

605,456
145
1,078
1,228

Thanks for the answer Erwin. I don't know PARTITION BY yet, and it is probably the right solution if the table would just be split into a few hundred partitions, but a quick search seems to indicate that "the big table is split into smaller physical pieces that can be stored in different storage media based on its use". I don't think my computer will be particularly happy if I create 100 Million + new files from the X Chromosome alone (as most primers occur only once), plus physical access is slower than memory access (I have 96 GB RAM). – Markus Winter Apr 30 '23 at 00:59
Just an FYI: genome size: ≈ 2.87 GB, 4^20 = 1,099,511,627,776, => random 20mer has a 0.26102498 % chance of occurring (1 in 383) in the genome and a 0.01264 % chance of occurring on the X Chromosome – Markus Winter Apr 30 '23 at 01:02
@MarkusWinter: Sorry, use a plain aggregate with the `UPDATE`. My initial draft was wrong, a spillover from the alternative answer. – Erwin Brandstetter Apr 30 '23 at 01:05
Thanks ;-) "Aside: "names" is a dodgy name." - agreed, but I thought easier to grasp for non-biologists than primers or 20mers … and yes, I'm free to create the table as I'm still at the experimental stage where I use a small "test chromosome" of just 1035 letters to check whether I get what I expect (especially when it comes to splitting the task up for the sub processes and joining the results together). On the plus side I brought the time for filling the table for the X Chromosome down from about 34 hours to 140 min and my latest version should bring that to under 60 min. – Markus Winter Apr 30 '23 at 01:24
@MarkusWinter: It's hard to tell when people dumb it down for the purpose of the question, and when dumb names are actually going to be used. ;) Useful names go a long way. https://stackoverflow.com/a/20880247/939860 – Erwin Brandstetter Apr 30 '23 at 01:28
P.S. As I have to do both directions (DNA is a double helix) that 140 million "letters" translates into 280 Million rows. – Markus Winter Apr 30 '23 at 01:29
Be sure to read the chapter [*"Populating a Database"*](https://www.postgresql.org/docs/current/populate.html#POPULATE-COPY-FROM) and use the [latest version of Postgres](https://www.postgresql.org/support/versioning/). – Erwin Brandstetter Apr 30 '23 at 01:30
"Useful names go a long way" - preaching to the choir there ;-). I go out of my way to use descriptive names in my programming, and hate nothing more than function declarations with parameters like doSomething(i, d, x, y, s) … the time you invest in typing (which isn't much with autocomplete) you get back tenfold when debugging … – Markus Winter Apr 30 '23 at 01:31
Latest version is "check" - downloaded and installed it this week. Compared to SQlite it feels a bit like going from a simple drawing app to Photoshop … – Markus Winter Apr 30 '23 at 01:33
@MarkusWinter: That about captures the order of magnitude. ;) Good luck! – Erwin Brandstetter Apr 30 '23 at 01:35

How to count and save occurrences in a very large Postgres database?

1 Answers1