How to aggregate based on other rows that share a key?

Question

I have a table in the following format:

I feel like this should be simple but I'm struggling to come up with a performant query that can perform aggregations based on other rows with a shared key. For example, I want to sum the rows for a user with the key MediaLength but only if the rows with the key Score that share the event_id are greater than or equal to 3.

The result from a simple sum:

SELECT SUM(value::float) FROM data WHERE key = 'MediaLength' AND user_id = '9765f312-0d0b-4db0-b4c5-217eec81d7c3'

Result: 40

The result I am trying to achieve here is 15. In the table above you can see the rows are children of an event. I only want to sum the value column where key = 'MediaLength' and its sister row with key = 'Score' has value >= 3.

This is the query I have tried so far but it seems a bit messy and also doesn't work due to a more than one row returned by subquery error:

select
    sum(value::float)
        filter (where (
            select d.value::float
            from data d
            where d.event_id = event_id
            and d.key = 'Score'
        ) >= 3)
from data
where user_id = '9765f312-0d0b-4db0-b4c5-217eec81d7c3'

This is a simple example but in the future I would need to filter on potentially multiple other keys as well, so any advice on how to extend that is also hugely appreciated.

I think you will need to scope your `d.event_id = event_id` to use the alias of the outer table explicitly (`d.event_id = outer.event_id`), otherwise the plain `event_id` will refer to the inner `data d` relation as well. — Bergi, Sep 19 '21 at 20:51
Instead of the screenshot (discouraged for data!) please provide a table definition and some sample data as *text*. Ideally, add a fiddle. Random example: https://dbfiddle.uk/?rdbms=postgres_13&fiddle=7fca2607d572b457ac89d2d5ef8041b2 — Erwin Brandstetter, Sep 19 '21 at 23:58
I took the liberty to sharpen the description. Please correct me if I didn't get it right. — Erwin Brandstetter, Sep 20 '21 at 00:13

Erwin Brandstetter · Accepted Answer · 2021-09-21T11:08:17.390

2

I only want to sum the value column where key = 'MediaLength' and its sister row with key = 'Score' has value >= 3.

SELECT sum(value::float)  -- why the cast?
FROM   data d
WHERE  user_id = '9765f312-0d0b-4db0-b4c5-217eec81d7c3'
AND    key = 'MediaLength'
AND    EXISTS (
   SELECT FROM data ds
   WHERE  ds.event_id = d.event_id
   AND    ds.user_id = d.user_id    -- !
   AND    ds.key = 'Score'
   AND    ds.value >= 3
   );

Here, rows with key = 'MediaLength' qualify if any sister passes the filter. (There may be more sisters failing the test.)

If there can only ever be a single qualifying sister row (enforced by a unique constraint / index?), a self-join is a bit simpler:

SELECT sum(value::float)
FROM   data d
JOIN   data ds USING (event_id, user_id)
WHERE  d.user_id = '9765f312-0d0b-4db0-b4c5-217eec81d7c3'
AND    d.key = 'MediaLength'
AND    ds.key = 'Score'
AND    ds.value >= 3;

The self-join would produce multiple result rows for multiple qualifying sister rows.

At its core, this can be cast as relational-division problem. Especially, since ...

in the future I would need to filter on potentially multiple other keys as well

See:

How to filter SQL results in a has-many-through relation

edited Sep 21 '21 at 11:08

answered Sep 20 '21 at 00:22

Erwin Brandstetter

605,456
145
1,078
1,228

Hi Erwin, thanks for such a detailed response! Would you be able to expand a little on the self-join and why that wouldn't work for filtering on multiple sister rows? – Ollie Sep 20 '21 at 12:05
1

@Ollie: The join in the 2nd query would produce multiple result rows for multiple qualifying sister rows, while the 1st query does not multiply rows like this. – Erwin Brandstetter Sep 21 '21 at 03:23
Ah yes, after some testing I see now, thanks again! If I could be so bold as to ask one more question: do you happen to know performance implications of the first query? Long story short we'll have a lot of aggregates happening with a dataset that is currently over 10M rows. – Ollie Sep 21 '21 at 07:13
1

@Ollie: For the query at hand I expect the 1st query to be the fastest possible option. Proper indexing to support it is essential for big tables. – Erwin Brandstetter Sep 24 '21 at 00:01

How to aggregate based on other rows that share a key?

1 Answers1