My experience with Postgres is pretty much nil, my computer is from 2010. So not the fastest, but still going strong.
I looked at recommended answers, but I don't get them (yet) - and I'd like to understand what I am doing.
I have a Postgres database with columns names
and number_of_occurrences
(plus some others that have unique values for each name). The database has about 300 million rows (will increase to about 6 billion), and MOST names only occur once, while some can occur thousands or even millions of times.
So far the database only contains the names, and it is my job now to fill the number_of_occurrences
column.
What would be the most efficient way of doing that?
Normally, I would imagine I should first use DISTINCT
on names
and then for each distinct value run a count, and then save that into a new table, and finally discard the old table.
However, that isn't possible as there are also other unique values for each row in the table, so the count has to be put back into the original table.
I'm actually trying to help out my wife with a problem at work. She works on the X Chromosome which is about 140 Million "letters" long. The problem is to find PAIRS of primers (fragments of 20 letters) that fulfill certain properties (eg GC content, Tm, etc), are within a certain distance of each other (eg 100-150 letters), and occur a few times on the X Chromosome but do not occur on the Y Chromosome (and ideally also not on any other Chromosomes).
First step was to create the primers (all 140 Million - 20 for the X Chromosome as the primers go from position 1-20, 2-21, 3-22, etc) and calculate the relevant values that are required later on (eg GC content, Tm for anyone with a biological background).
Primary key of the table is id, an auto-generated UUID (will be the same for all other tables).
No indices have been created yet - as I understand it keeping the index updated while adding rows is expensive, and it is much faster to create the complete table and then index it.
Schema and other tables are irrelevant to the problem and just clutter the mind.
I thought number_of_occurrences
should be done AFTER the table has been created, but with the idea of making number_of_occurrences
a foreign key I can actually set or increase the value each time I add a row, so the problem disappears.