Add computed value as new column while also removing duplicates

Question

Let's say I have the following table playgrounds:

 serialnumber  length  breadth  country
 1             15      10       Brazil
 2             12      11       Chile
 3             14      10       Brazil
 4             14      10       Brazil

Now, I want to add a column area to the table, that is essentially length*breadth.

Obviously, I can do this update:

UPDATE playground set area = length*breadth where country = 'Brazil'.

Using the above statement, I will have to unnecessarily compute length * breadth twice for serial number 3 and 4. Is there a way to add group by and minimize the amount of calculations?

Something like:

UPDATE playground set area = length*breadth where country = 'Brazil'
group by length, breadth;

There are some ways, how to do calculate expressions only once time, but the cost of calculations have to be pretty high to use these methods. The simple method (repeated evaluations) doesn't need random io - you do just seq scan. Other methods needs selecting distinct values, calculations and then joining back of calculated values. Your issue is related to wrong normalisation - on good designed database, you don't need to solve similar issues. — Pavel Stehule, Jul 27 '16 at 19:43

score 2 · Accepted Answer · answered Jul 27 '16 at 23:09

2

The first thing to note is that you should not add the area as a column. Data items that happen to be the result of simple arithmetic operations do not need their own column.

The second point is that you don't need to worry about doing a multiplication operation once each for rows 3 and 4. That's almost zero effort for the server

Third point is that if you are worried about rows 3 and 4, that means they are duplicated, and duplicated data should not be in the database. Consider deleting duplicates as described here: https://wiki.postgresql.org/wiki/Deleting_duplicates

answered Jul 27 '16 at 23:09

e4c5

52,766
11
101
134

I understand that data is duplicated. In my scenario there are more than 40 columns in my table, and the value that I am trying to compute uses more than 5 columns in the same table. The columns are not used directly and they are used after some calculation/conditioning internally. For example, I check if a particular string in one of those 4 columns follows a particular regular expression pattern and convert it into an integer and then use it to multiply with 4 more columns that go through more conditions and compute the resultant. Contd. – Sandy Jul 28 '16 at 20:49
I know I could separate the 4 columns and the resultant in separate table, but I feel doing that is really over-normalization in my case. – Sandy Jul 28 '16 at 20:52

score 2 · Answer 2 · edited May 23 '17 at 11:58

To answer your question:

Is there a way, I could add group by and minimize the amount of calculations?

SELECT DISTINCT ON (1,2,3)
       length, breadth, country, length * breadth AS area
FROM   playgrounds
ORDER  BY 1, 2, 3, serialnumber;

This takes the row with the smallest serialnumber from each set of duplicates. Detailed explanation:

Select first row in each GROUP BY group?

But consider the @e4c5's answer and Pavel's comment first. Don't store functionally dependent values that can be computed on the fly cheaply. Just drop duplicate rows and use a view:

To permanently delete dupes with greater serialnumber:

DELETE FROM playgrounds p
WHERE  EXISTS (
   SELECT 1
   FROM   playgrounds
   WHERE  length  = p.length
          breadth = p.breadth
          country = p.country
   AND    serialnumber < p.serialnumber
   );

Then:

CREATE VIEW playgrounds_plus AS
SELECT *, length * breadth AS area
FROM   playgrounds;

Clean up SQL data before unique constraint

Add computed value as new column while also removing duplicates

2 Answers2