4

In a PostgreSQL 9.5.3 DB, I have a credit_card_balances table referencing a persons table which tracks the balances of various credit cards associated to a particular person:

CREATE TABLE persons (
  id serial PRIMARY KEY,
  name text
);

CREATE credit_card_balances (
  id serial PRIMARY KEY,
  card_provider text, 
  person int REFERENCES persons,
  balance decimal, 
  timestamp timestamp
);

Example row for credit_card_balances:

id  |  card_provider | person  | balance | timestamp
123 |  visa          | 1234    | 1.00    | 16-07-26 17:00

I need to retrieve the collection of people who have both a 'visa' and an 'amex' card, such that the most recent balance on the 'visa' card is larger than the most recent balance on the 'amex' card.

For each (person, card_provider) there may be up to around 100 rows in the table. Ideally, the output columns would be:

person, provider1_balance, provider2_balance, provider1_timestamp, provider2_timestamp

I know that I can do something like

SELECT DISTINCT ON (card_provider) *
FROM credit_card_balances 
WHERE person=1234
ORDER BY card_provider, timestamp DESC;

to get the most recent balances for each card for a particular person. But I'm not sure how to do that over all people and verify the conditions above, or if this is even the correct approach.

Edit: AS partially suggested in an answer, I can also do something like

SELECT * from credit_card_balances b1, credit_card_balances b2
WHERE b1.person = b2.person
AND (b1.card_provider = 'amex' 
     AND b1.timestamp in
        (SELECT MAX(time_stamp) 
         FROM credit_card_balances 
         WHERE card_provider = 'amex'))

AND (b2.card_provider = 'visa'
     AND <... same as above>)
AND b1.balance > b2.balance;

But I noticed that this leads to horrible performance. So I think this isn't such a good option.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
ira
  • 735
  • 1
  • 7
  • 12
  • 1
    The usual suspects are missing: table definition (showing data types and constraints) and Postgres version. The best answer depends on these, as well as on cardinalities and value frequencies. Roughly how many rows per `(person, card_provider)` on avg.? Desired columns in the result would be swell, too. First things last: is there a table with distinct people in the same database? Typically, you have something like a `person` table for that. – Erwin Brandstetter Jul 26 '16 at 22:11
  • @ErwinBrandstetter Thank you for the helpful comments - I've added all that information to the post now! – ira Jul 26 '16 at 22:34
  • Much better! `timestamp` isn't defined `NOT NULL`? – Erwin Brandstetter Jul 26 '16 at 22:38

3 Answers3

3

This problem is a combination of two classics: and .

Given your updated specifications and with up to around 100 rows per (person, card_provider), I would expect this query to be substantially faster than what we have so far:

SELECT a.person
     , a.balance   AS amex_balance
     , v.balance   AS visa_balance
     , a.timestamp AS amex_timestamp
     , v.timestamp AS visa_timestamp
FROM   persons p
CROSS  JOIN LATERAL (
   SELECT balance, timestamp
   FROM   credit_card_balances 
   WHERE  person = p.id
   AND    card_provider = 'amex'  -- more selective credit card first to optimize
   ORDER  BY timestamp DESC
   LIMIT  1
   ) a
JOIN   LATERAL (
   SELECT balance, timestamp
   FROM   credit_card_balances 
   WHERE  person = p.id
   AND    card_provider = 'visa'  -- 2nd cc
   ORDER  BY timestamp DESC
   LIMIT  1
   ) v ON v.balance > a.balance;

Index support is crucial. This would be ideal for the case:

CREATE INDEX ON credit_card_balances (person, card_provider, timestamp DESC, balance);

Adding balance as last index column only makes sense if you get index-only scans out of it.

This is assuming that timestamp is defined NOT NULL, else you may need to add need NULLS LAST to query and index.

Related:


For only few rows per (person, card_provider) an approach with DISTINCT ON might be faster. A separate persons table wouldn't help. The sweet spot depends on many factors.

Assuming at least several different credit cards.

DISTINCT ON for one credit card, a LATERAL subquery for the other:

SELECT a.person
     , a.balance   AS amex_balance
     , v.balance   AS visa_balance
     , a.timestamp AS amex_timestamp
     , v.timestamp AS visa_timestamp
FROM  (
   SELECT DISTINCT ON (person)
          person, balance, timestamp
   FROM   credit_card_balances 
   WHERE  card_provider = 'amex'  -- the more selective credit card first
   ORDER  BY person, timestamp DESC
   ) a
JOIN  LATERAL (
   SELECT balance, timestamp
   FROM   credit_card_balances 
   WHERE  card_provider = 'visa'
   AND    person = a.person
   ORDER  BY timestamp DESC
   LIMIT  1
   ) v ON v.balance > a.balance

DISTINCT ON for each credit card, then join:

SELECT a.person
     , a.balance   AS amex_balance
     , v.balance   AS visa_balance
     , a.timestamp AS amex_timestamp
     , v.timestamp AS visa_timestamp
FROM  (
   SELECT DISTINCT ON (person)
          person, balance, timestamp
   FROM   credit_card_balances 
   WHERE  card_provider = 'amex'
   ORDER  BY person, timestamp DESC
   ) a
JOIN  (
   SELECT DISTINCT ON (person)
          person, balance, timestamp
   FROM   credit_card_balances 
   WHERE  card_provider = 'visa'
   ORDER  BY person, timestamp DESC
   ) v USING (person)
WHERE  v.balance > a.balance;

Or, my favorite: one DISTINCT ON for both credit cards, then filtered aggregate with a HAVING condition:

SELECT person
     , max(balance)   FILTER (WHERE card_provider = 'amex') AS amex_balance
     , max(balance)   FILTER (WHERE card_provider = 'visa') AS visa_balance
     , max(timestamp) FILTER (WHERE card_provider = 'amex') AS amex_timestamp
     , max(timestamp) FILTER (WHERE card_provider = 'visa') AS visa_timestamp
FROM  (
   SELECT DISTINCT ON (person, card_provider)
          person, card_provider, balance, timestamp
   FROM   credit_card_balances 
   WHERE  card_provider IN ('amex', 'visa')
   ORDER  BY person, card_provider, timestamp DESC
   ) c
GROUP  BY person
HAVING max(balance) FILTER (WHERE card_provider = 'visa')
     > max(balance) FILTER (WHERE card_provider = 'amex');

The aggregate FILTER clause requires Postgres 9.4+:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    Thank you for the fantastic and informative answer, as well as your suggestions on making my question better! Your first solution is perfect for my needs. For future people reading this, he's not messing around when he says that index support is _crucial_ – ira Jul 27 '16 at 11:49
0

Use a self join. Something like:

SELECT * from credit_card_balances b1, credit_card_balances b2
WHERE b1.person = b2.person
  AND b1.card_provider = 'amex'
  AND b2.card_provider = 'visa'
  AND b1.balance > b2.balance;

Combining this with more or less what you already came up with results in the following, using a view to keep the query easier to understand.

CREATE VIEW most_recent_balance AS
  SELECT DISTINCT ON (person, card_provider) *
    FROM credit_card_balances 
   GROUP BY id, person
   ORDER BY person, card_provider, timestamp DESC;

Substitute this most_recent_balance view in place of the table in the self join query.

alzee
  • 1,393
  • 1
  • 10
  • 21
  • Thanks for the response - I would say that the aspect I'm stuck on is combining this part with the other part. I'd figured this part out and the other individually, but putting them together has been hairy. – ira Jul 26 '16 at 19:24
  • You can do this with a view but I'm not sure if the performance is going to be any better than your updated query. I'll update my answer with the view in a moment. – alzee Jul 26 '16 at 19:45
  • Your `DISTINCT ON` and `GROUP BY` clauses contradict each other. Removing `GROUP BY` would make it float. – Erwin Brandstetter Jul 27 '16 at 13:59
0

You can do it with helf of nested select and window function

select * from (
     select *, 
       rank() over(partition by card_provider order by balance desc) as rank 
     from credit_card_balances
) credit_card_balances_ranked
where rank = 1
Taleh Ibrahimli
  • 750
  • 4
  • 13
  • 29