3

I have a log table purchase_history that keeps track of customer purchase history and I would like to get most recent purchase info for each products for a given customer_id order by date_purchased.

The table has 10s of millions records and the solution I have is very slow (20+ seconds) for some customer_id which contains most of records in the table (25% records for certain customer_id for example), for some other customer_id that has only a few rows, it is very fast (1 second).

Table definition:

create table purchase_history (
  id int PRIMARY KEY,
  product_name VARCHAR(100),
  date_purchased date,
  customer_id int
);

Some dummy data:

INSERT into purchase_history VALUES (
    1, 'A', '2017-10-10', 123)
 , (2, 'A', '2017-10-11', 123)
 , (3, 'B', '2017-10-12', 123)
 , (4, 'C', '2017-10-09', 123)
 , (5, 'B', '2017-11-10', 123);

I have a multi-column index on (customer_id, product_name, date_purchased)

Results I indented to get:

5,B,2017-11-10
2,A,2017-10-11
4,C,2017-10-09

Solution I came up so far:

SELECT *
FROM (
       SELECT DISTINCT ON (product_name) *
       FROM purchase_history
       WHERE customer_id = 123
       ORDER BY product_name, date_purchased DESC
     ) t
ORDER BY date_purchased DESC;

I wonder if there's better or faster solution?


Updates: 01/14/2018

Thanks for the comments and answers so far, and sorry for the confusion. I would like to add a few more details:

  1. All columns are not null including date_purchased
  2. The index I have matches the ordering (date_purchased DESC)

    create index purchase_history_idx on purchase_history(customer_id, product_name, date_purchased DESC)
    
  3. It is a good point to use product_id that refers to another table but unfortunately production_name doesn't exist in any other table. It is a name specified by the customer. Let's say I have a UI for customers to enter what they want to purchase, and what exactly entered by the customers is product_name. So purchase_history keeps track of all "wishlist" for all customers.

Number of records:

  • There's a total of 20M records in the table
  • customer_id=123 is our biggest customer that contains 8573491 records, or 42%
  • customer_id=124 is our 2nd biggest customer that contains 3062464 records, or 15%

Here is the explain analyze for my original distinct on solution:

Sort  (cost=2081285.86..2081607.09 rows=128492 width=106) (actual time=11771.444..12012.732 rows=623680 loops=1)
  Sort Key: purchase_history.date_purchased
  Sort Method: external merge  Disk: 69448kB
  ->  Unique  (cost=0.56..2061628.55 rows=128492 width=106) (actual time=0.021..11043.910 rows=623680 loops=1)
        ->  Index Scan using purchase_history_idx on purchase_history  (cost=0.56..2040413.77 rows=8485910 width=106) (actual time=0.019..8506.109 rows=8573491 loops=1)
              Index Cond: (customer_id = 123)
Planning time: 0.098 ms
Execution time: 12133.664 ms

Here is the explain analyze for the CTE solution from Erwin

Sort  (cost=125.62..125.87 rows=101 width=532) (actual time=30924.208..31154.908 rows=623680 loops=1)
  Sort Key: cte.date_purchased
  Sort Method: external merge  Disk: 33880kB
  CTE cte
    ->  Recursive Union  (cost=0.56..120.23 rows=101 width=39) (actual time=0.022..29772.944 rows=623680 loops=1)
          ->  Limit  (cost=0.56..0.80 rows=1 width=39) (actual time=0.020..0.020 rows=1 loops=1)
                ->  Index Scan using purchase_history_idx on purchase_history  (cost=0.56..2040413.77 rows=8485910 width=39) (actual time=0.019..0.019 rows=1 loops=1)
                      Index Cond: (customer_id = 123)
          ->  Nested Loop  (cost=0.56..11.74 rows=10 width=39) (actual time=0.046..0.047 rows=1 loops=623680)
                ->  WorkTable Scan on cte c  (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
                ->  Limit  (cost=0.56..1.13 rows=1 width=39) (actual time=0.045..0.045 rows=1 loops=623680)
                      ->  Index Scan using purchase_history_idx on purchased_history purchased_history_1  (cost=0.56..1616900.83 rows=2828637 width=39) (actual time=0.044..0.044 rows=1 loops=623680)
                            Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
  ->  CTE Scan on cte  (cost=0.00..2.02 rows=101 width=532) (actual time=0.024..30269.107 rows=623680 loops=1)
Planning time: 0.207 ms
Execution time: 31273.462 ms

The other thing surprises me is that my query runs much slower for customer_id=124 which contains much less records than customer_id=123(Note: Index Scan is not used, Bitmap Index Scan is used instead which I don't know why)

Sort  (cost=1323695.21..1323812.68 rows=46988 width=106) (actual time=85739.561..85778.735 rows=109347 loops=1)
  Sort Key: purchase_history.date_purchased
  Sort Method: external merge  Disk: 14560kB
  ->  Unique  (cost=1301329.65..1316845.56 rows=46988 width=106) (actual time=60443.890..85608.347 rows=109347 loops=1)
        ->  Sort  (cost=1301329.65..1309087.61 rows=3103183 width=106) (actual time=60443.888..84727.062 rows=3062464 loops=1)
"              Sort Key: purchase_history.product_name, purchase_history.date_purchased"
              Sort Method: external merge  Disk: 427240kB
              ->  Bitmap Heap Scan on purchase_history  (cost=203634.23..606098.02 rows=3103183 width=106) (actual time=8340.662..10584.483 rows=3062464 loops=1)
                    Recheck Cond: (customer_id = 124)
                    Rows Removed by Index Recheck: 4603902
                    Heap Blocks: exact=41158 lossy=132301
                    ->  Bitmap Index Scan on purchase_history_idx  (cost=0.00..202858.43 rows=3103183 width=0) (actual time=8331.711..8331.711 rows=3062464 loops=1)
                          Index Cond: (customer_id = 124)
Planning time: 0.102 ms
Execution time: 85872.871 ms

Update 01/15/2018

Here is the explain (analyze,buffers) asked by riskop:

GroupAggregate  (cost=0.56..683302.46 rows=128492 width=31) (actual time=0.028..5156.113 rows=623680 loops=1)
  Group Key: product_name
  Buffers: shared hit=1242675
  ->  Index Only Scan using purchase_history_idx on purchase_history  (cost=0.56..639587.99 rows=8485910 width=31) (actual time=0.022..2673.661 rows=8573491 loops=1)
        Index Cond: (customer_id = 123)
        Heap Fetches: 0
        Buffers: shared hit=1242675
Planning time: 0.079 ms
Execution time: 5272.877 ms

Note I can't use this query even if it is faster for two reasons:

  1. The ordering is not specified in the query whereas my expected resultset is ordered by date_purchased DESC
  2. There's a few more columns I need to include in the resultset. So I can't just use group by.

One way to get around both issues is to use riskop's group by based query as a subquery or CTE, add order by and more columns as needed.


Update 01/21/2018

Taking advantage of "loose indexscan" sounds a good idea, but unfortunately the product_name is highly distributed. There's 1810440 unique product_name and 2565179 unique product_name and customer_id combination:

select count(distinct product_name) from purchase_history; -- 1810440

select count(distinct (customer_id, product_name)) from purchase_history; -- 2565179

As a result, the 313ms query for riskop took 33 seconds for me:

Sort  (cost=122.42..122.68 rows=101 width=532) (actual time=33509.943..33748.856 rows=623680 loops=1)
  Sort Key: cte.date_purchased
  Sort Method: external merge  Disk: 33880kB
"  Buffers: shared hit=3053791 read=69706, temp read=4244 written=8484"
  CTE cte
    ->  Recursive Union  (cost=0.56..117.04 rows=101 width=39) (actual time=5.886..32288.212 rows=623680 loops=1)
          Buffers: shared hit=3053788 read=69706
          ->  Limit  (cost=0.56..0.77 rows=1 width=39) (actual time=5.885..5.885 rows=1 loops=1)
                Buffers: shared hit=5 read=3
                ->  Index Scan using purchase_history_idx on purchase_history  (cost=0.56..1809076.40 rows=8543899 width=39) (actual time=5.882..5.882 rows=1 loops=1)
                      Index Cond: (customer_id = 123)
                      Buffers: shared hit=5 read=3
          ->  Nested Loop  (cost=0.56..11.42 rows=10 width=39) (actual time=0.050..0.051 rows=1 loops=623680)
                Buffers: shared hit=3053783 read=69703
                ->  WorkTable Scan on cte c  (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
                ->  Limit  (cost=0.56..1.10 rows=1 width=39) (actual time=0.049..0.049 rows=1 loops=623680)
                      Buffers: shared hit=3053783 read=69703
                      ->  Index Scan using purchase_history_idx on purchase_history purchase_history_1  (cost=0.56..1537840.29 rows=2847966 width=39) (actual time=0.048..0.048 rows=1 loops=623680)
                            Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
                            Buffers: shared hit=3053783 read=69703
  ->  CTE Scan on cte  (cost=0.00..2.02 rows=101 width=532) (actual time=5.889..32826.816 rows=623680 loops=1)
"        Buffers: shared hit=3053788 read=69706, temp written=4240"
Planning time: 0.278 ms
Execution time: 33873.798 ms

Notice it did in-memory sort: Sort Method: quicksort Memory: 853kB for riskop but external disk sort: Sort Method: external merge Disk: 33880kB for me.

If it is not a solvable problem with relational DB, I wonder if there's any other non relational DB, or big data based solution, as long as it meets 2 requirements:

  1. Reasonable response time(2 seconds for example).
  2. Realtime without delay.
Liang Zhou
  • 2,055
  • 19
  • 20
  • If there are many rows it sometimes resorts to a sequential scan. Can you post an explain analyze? I don't think a group by, ie. "select product_name, date_purchased from purchase_history where customer_id = 123 group by product_name, date_purchased" will help but worth trying. – kometen Jan 12 '18 at 09:34
  • {product_name, date_purchased} could be a natural key. (iff it is unique, which it is not) Same for {customer_id, date_purchased} So you end up with all three of them as the natural key. (iff date_purchased were unique enough ... -->> it should be a timestamp) – joop Jan 12 '18 at 11:32
  • So do you have your answer? – Erwin Brandstetter Jan 14 '18 at 02:13
  • You can create a "helper" table with columns (customer_id,product_id,last_purchase_date,id). In that table customer_id and product_id would be composite key. According to your update on 21st jan. there would be about 2.5M records in that table. That's much less than the original. You can also have an index on this table on columns (customer_id, last_purchase_date). I expect queries searching for customer_id + last_purchase_date will be very quick. The price for this is that you have to maintain the new table and it's index every time when a record is inserted into the 20M table. – riskop Feb 08 '18 at 08:56

4 Answers4

3

Index

Postgres can scan indexes backwards very efficiently, but I would still make that index out to match perfectly:

(customer_id, product_name, date_purchased DESC)

This is a minor optimization, but since date_purchased can be NULL according to your table definition, you probably want ORDER BY product_name, date_purchased DESCNULLS LAST, which should be accompanied with a matching index - which then is a major optimization:

CREATE INDEX new_idx ON purchase_history
(customer_id, product_name, date_purchased DESC NULLS LAST);

Related:

Query

DISTINCT ON is very efficient for few rows per (customer_id, product_name), but less so for many rows, which is your weak spot.

This recursive CTE should be able to make perfect use of a matching index:

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, product_name, date_purchased
   FROM   purchase_history
   WHERE  customer_id = 123
   ORDER  BY product_name, date_purchased DESC NULLS LAST
   LIMIT  1
   )

   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, product_name, date_purchased
      FROM   purchase_history
      WHERE  customer_id = 123               -- repeat condition
      AND    product_name > c.product_name   -- lateral reference
      ORDER  BY product_name, date_purchased DESC NULLS LAST
      LIMIT  1
      ) u
   )
TABLE  cte
ORDER  BY date_purchased DESC NULLS LAST;

db<>fiddle here

Related, with detailed explanation:

You might even fork the logic and run the rCTE for customers with many rows, while sticking to DISTINCT ON for customers with few rows ...

Schema

Notably, your table purchase_history has product_name varchar(100). In a perfect world (normalized schema), this would be product_id int instead, with a FK reference to a product table. This would help performance in multiple ways: smaller table and index, faster operations on integer instead of varchar(100).

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
1

Try to express your GROUP BY explicitly

SELECT *
FROM purchase_history ph
JOIN 
(
       SELECT product_name, MAX(date_purchased) max_date_purchased
       FROM purchase_history
       WHERE customer_id = 123
       GROUP BY product_name
) t ON ph.product_name = t.product_name and
       ph.date_purchased = t.max_date_purchased
       ph.customer_id = 123
ORDER BY ph.date_purchased DESC;

and another solution is to use window functions

SELECT *
FROM 
(
       SELECT *,
             dense_rank() over (partition by product_name order by date_purchased desc) rn
       FROM purchase_history
       WHERE customer_id = 123
) t 
WHERE t.rn = 1
ORDER BY t.date_purchased DESC;

Test it and you will see which one is more performant.

Radim Bača
  • 10,646
  • 1
  • 19
  • 33
  • While the queries look good, I would not expect either to be notably faster than the original. There is a corner case bug, though: `MAX(date_purchased)` is not equivalent to the original if NULL values can be involved (which is the case according to the table definition). – Erwin Brandstetter Jan 12 '18 at 12:55
1

I think the most important question is what is the distribution of the product_name in your data.

You mentioned that users fill this with product names, so I guess that you have a few thousand distinct product_name values.

If that is the case, then I think your problem is that Postgresql is not using "loose indexscan" (https://wiki.postgresql.org/wiki/Loose_indexscan), even if the distinct values are small compared to the overall number of records.

Good article describing a case very similar to yours: http://malisper.me/the-missing-postgres-scan-the-loose-index-scan/

So I've tried to reproduce your big dataset. The test data created by the below procedure consists of 20 million rows. There are 10000 kind of products (product_name is random value between 0 and 10000). There are 45 distinct customer_id, 43% is "123", 15% is "124", the remaining 42% is distributed randomly between 59 and 100. The date_purchased is random day between 1092-04-05 and 1913-08-19.

do '
begin 
drop table purchase_history;
create table purchase_history (
  id int,
  product_name VARCHAR(100) not null,
  date_purchased date not null,
  customer_id int not null
);
FOR i IN 0..20000000 - 1 LOOP
insert into purchase_history values (
i, 
(select trunc(random() * 10000)), 
to_date('''' || (select trunc(random() * 300000 + 2120000)), ''J''), 
(select trunc(random() * 100))
);
end loop;
update purchase_history set customer_id=123 where customer_id < 43;
update purchase_history set customer_id=124 where customer_id < 58;
ALTER TABLE purchase_history ADD PRIMARY KEY (id);
end;
'

The index is the same as in your post:

CREATE INDEX idx ON purchase_history
(customer_id, product_name, date_purchased desc);

Just for making sure that we have indeed 10000 distinct product_name:

SELECT product_name FROM purchase_history GROUP BY product_name;

Now the "reference" query runs in 3200 millisec on this dataset:

explain (analyze,buffers)
SELECT product_name, max(date_purchased)
FROM purchase_history 
WHERE customer_id = 123
GROUP BY product_name
order by max(date_purchased) desc;

Execution:

Sort  (cost=171598.50..171599.00 rows=200 width=222) (actual time=3219.176..3219.737 rows=10000 loops=1)
Sort Key: (max(date_purchased)) DESC
Sort Method: quicksort  Memory: 853kB
Buffers: shared hit=3 read=105201 written=11891
->  HashAggregate  (cost=171588.86..171590.86 rows=200 width=222) (actual time=3216.382..3217.361 rows=10000 loops=1)
      Group Key: product_name
      Buffers: shared hit=3 read=105201 written=11891
      ->  Bitmap Heap Scan on purchase_history  (cost=2319.56..171088.86 rows=100000 width=222) (actual time=766.196..1634.934 rows=8599329 loops=1)
            Recheck Cond: (customer_id = 123)
            Rows Removed by Index Recheck: 15263
            Heap Blocks: exact=45627 lossy=26625
            Buffers: shared hit=3 read=105201 written=11891
            ->  Bitmap Index Scan on idx  (cost=0.00..2294.56 rows=100000 width=0) (actual time=759.686..759.686 rows=8599329 loops=1)
                  Index Cond: (customer_id = 123)
                  Buffers: shared hit=3 read=32949 written=11859
Planning time: 0.192 ms
Execution time: 3220.096 ms

The optimized query -- basically the same as Erwin's -- which uses the index and does "Loose indexscan" with the help of iterative CTE (misleadingly named 'recursive' CTE) runs in only 310 millisec, which is about 10 times faster:

explain (analyze,buffers)
WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, product_name, date_purchased
   FROM   purchase_history
   WHERE  customer_id = 123
   ORDER  BY product_name, date_purchased DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, product_name, date_purchased
      FROM   purchase_history
      WHERE  customer_id = 123               -- repeat condition
      AND    product_name > c.product_name   -- lateral reference
      ORDER  BY product_name, date_purchased DESC
      LIMIT  1
      ) u
   )
TABLE  cte
ORDER  BY date_purchased DESC NULLS LAST;

Execution:

Sort  (cost=444.02..444.27 rows=101 width=226) (actual time=312.928..313.585 rows=10000 loops=1)
Sort Key: cte.date_purchased DESC NULLS LAST
Sort Method: quicksort  Memory: 853kB
Buffers: shared hit=31432 read=18617 written=14
CTE cte
  ->  Recursive Union  (cost=0.56..438.64 rows=101 width=226) (actual time=0.054..308.678 rows=10000 loops=1)
        Buffers: shared hit=31432 read=18617 written=14
        ->  Limit  (cost=0.56..3.79 rows=1 width=226) (actual time=0.052..0.053 rows=1 loops=1)
              Buffers: shared hit=4 read=1
              ->  Index Scan using idx on purchase_history  (cost=0.56..322826.56 rows=100000 width=226) (actual time=0.050..0.050 rows=1 loops=1)
                    Index Cond: (customer_id = 123)
                    Buffers: shared hit=4 read=1
        ->  Nested Loop  (cost=0.56..43.28 rows=10 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
              Buffers: shared hit=31428 read=18616 written=14
              ->  WorkTable Scan on cte c  (cost=0.00..0.20 rows=10 width=218) (actual time=0.000..0.000 rows=1 loops=10000)
              ->  Limit  (cost=0.56..4.29 rows=1 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
                    Buffers: shared hit=31428 read=18616 written=14
                    ->  Index Scan using idx on purchase_history purchase_history_1  (cost=0.56..124191.22 rows=33333 width=226) (actual time=0.030..0.030 rows=1 loops=10000)
                          Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
                          Buffers: shared hit=31428 read=18616 written=14
->  CTE Scan on cte  (cost=0.00..2.02 rows=101 width=226) (actual time=0.058..310.821 rows=10000 loops=1)
      Buffers: shared hit=31432 read=18617 written=14
Planning time: 0.418 ms
Execution time: 313.988 ms
riskop
  • 1,693
  • 1
  • 16
  • 34
0

Could you tell us the result of the following simplified query in your environment?

explain (analyze,buffers)
SELECT product_name, max(date_purchased) 
FROM purchase_history 
WHERE customer_id = 123
GROUP BY product_name;
riskop
  • 1,693
  • 1
  • 16
  • 34