I have a log table purchase_history
that keeps track of customer purchase history and I would like to get most recent purchase info for each products for a given customer_id
order by date_purchased
.
The table has 10s of millions records and the solution I have is very slow (20+ seconds) for some customer_id
which contains most of records in the table (25% records for certain customer_id
for example), for some other customer_id
that has only a few rows, it is very fast (1 second).
Table definition:
create table purchase_history (
id int PRIMARY KEY,
product_name VARCHAR(100),
date_purchased date,
customer_id int
);
Some dummy data:
INSERT into purchase_history VALUES (
1, 'A', '2017-10-10', 123)
, (2, 'A', '2017-10-11', 123)
, (3, 'B', '2017-10-12', 123)
, (4, 'C', '2017-10-09', 123)
, (5, 'B', '2017-11-10', 123);
I have a multi-column index on (customer_id
, product_name
, date_purchased
)
Results I indented to get:
5,B,2017-11-10
2,A,2017-10-11
4,C,2017-10-09
Solution I came up so far:
SELECT *
FROM (
SELECT DISTINCT ON (product_name) *
FROM purchase_history
WHERE customer_id = 123
ORDER BY product_name, date_purchased DESC
) t
ORDER BY date_purchased DESC;
I wonder if there's better or faster solution?
Updates: 01/14/2018
Thanks for the comments and answers so far, and sorry for the confusion. I would like to add a few more details:
- All columns are
not null
includingdate_purchased
The index I have matches the ordering (
date_purchased DESC
)create index purchase_history_idx on purchase_history(customer_id, product_name, date_purchased DESC)
It is a good point to use
product_id
that refers to another table but unfortunatelyproduction_name
doesn't exist in any other table. It is a name specified by the customer. Let's say I have a UI for customers to enter what they want to purchase, and what exactly entered by the customers isproduct_name
. Sopurchase_history
keeps track of all "wishlist" for all customers.
Number of records:
- There's a total of 20M records in the table
customer_id=123
is our biggest customer that contains 8573491 records, or 42%customer_id=124
is our 2nd biggest customer that contains 3062464 records, or 15%
Here is the explain analyze for my original distinct on
solution:
Sort (cost=2081285.86..2081607.09 rows=128492 width=106) (actual time=11771.444..12012.732 rows=623680 loops=1)
Sort Key: purchase_history.date_purchased
Sort Method: external merge Disk: 69448kB
-> Unique (cost=0.56..2061628.55 rows=128492 width=106) (actual time=0.021..11043.910 rows=623680 loops=1)
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..2040413.77 rows=8485910 width=106) (actual time=0.019..8506.109 rows=8573491 loops=1)
Index Cond: (customer_id = 123)
Planning time: 0.098 ms
Execution time: 12133.664 ms
Here is the explain analyze for the CTE solution from Erwin
Sort (cost=125.62..125.87 rows=101 width=532) (actual time=30924.208..31154.908 rows=623680 loops=1)
Sort Key: cte.date_purchased
Sort Method: external merge Disk: 33880kB
CTE cte
-> Recursive Union (cost=0.56..120.23 rows=101 width=39) (actual time=0.022..29772.944 rows=623680 loops=1)
-> Limit (cost=0.56..0.80 rows=1 width=39) (actual time=0.020..0.020 rows=1 loops=1)
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..2040413.77 rows=8485910 width=39) (actual time=0.019..0.019 rows=1 loops=1)
Index Cond: (customer_id = 123)
-> Nested Loop (cost=0.56..11.74 rows=10 width=39) (actual time=0.046..0.047 rows=1 loops=623680)
-> WorkTable Scan on cte c (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
-> Limit (cost=0.56..1.13 rows=1 width=39) (actual time=0.045..0.045 rows=1 loops=623680)
-> Index Scan using purchase_history_idx on purchased_history purchased_history_1 (cost=0.56..1616900.83 rows=2828637 width=39) (actual time=0.044..0.044 rows=1 loops=623680)
Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
-> CTE Scan on cte (cost=0.00..2.02 rows=101 width=532) (actual time=0.024..30269.107 rows=623680 loops=1)
Planning time: 0.207 ms
Execution time: 31273.462 ms
The other thing surprises me is that my query runs much slower for customer_id=124
which contains much less records than customer_id=123
(Note: Index Scan is not used, Bitmap Index Scan is used instead which I don't know why)
Sort (cost=1323695.21..1323812.68 rows=46988 width=106) (actual time=85739.561..85778.735 rows=109347 loops=1)
Sort Key: purchase_history.date_purchased
Sort Method: external merge Disk: 14560kB
-> Unique (cost=1301329.65..1316845.56 rows=46988 width=106) (actual time=60443.890..85608.347 rows=109347 loops=1)
-> Sort (cost=1301329.65..1309087.61 rows=3103183 width=106) (actual time=60443.888..84727.062 rows=3062464 loops=1)
" Sort Key: purchase_history.product_name, purchase_history.date_purchased"
Sort Method: external merge Disk: 427240kB
-> Bitmap Heap Scan on purchase_history (cost=203634.23..606098.02 rows=3103183 width=106) (actual time=8340.662..10584.483 rows=3062464 loops=1)
Recheck Cond: (customer_id = 124)
Rows Removed by Index Recheck: 4603902
Heap Blocks: exact=41158 lossy=132301
-> Bitmap Index Scan on purchase_history_idx (cost=0.00..202858.43 rows=3103183 width=0) (actual time=8331.711..8331.711 rows=3062464 loops=1)
Index Cond: (customer_id = 124)
Planning time: 0.102 ms
Execution time: 85872.871 ms
Update 01/15/2018
Here is the explain (analyze,buffers)
asked by riskop:
GroupAggregate (cost=0.56..683302.46 rows=128492 width=31) (actual time=0.028..5156.113 rows=623680 loops=1)
Group Key: product_name
Buffers: shared hit=1242675
-> Index Only Scan using purchase_history_idx on purchase_history (cost=0.56..639587.99 rows=8485910 width=31) (actual time=0.022..2673.661 rows=8573491 loops=1)
Index Cond: (customer_id = 123)
Heap Fetches: 0
Buffers: shared hit=1242675
Planning time: 0.079 ms
Execution time: 5272.877 ms
Note I can't use this query even if it is faster for two reasons:
- The ordering is not specified in the query whereas my expected resultset is ordered by
date_purchased DESC
- There's a few more columns I need to include in the resultset. So I can't just use
group by
.
One way to get around both issues is to use riskop's group by
based query as a subquery or CTE, add order by
and more columns as needed.
Update 01/21/2018
Taking advantage of "loose indexscan" sounds a good idea, but unfortunately the product_name
is highly distributed. There's 1810440 unique product_name
and 2565179 unique product_name
and customer_id
combination:
select count(distinct product_name) from purchase_history; -- 1810440
select count(distinct (customer_id, product_name)) from purchase_history; -- 2565179
As a result, the 313ms query for riskop took 33 seconds for me:
Sort (cost=122.42..122.68 rows=101 width=532) (actual time=33509.943..33748.856 rows=623680 loops=1)
Sort Key: cte.date_purchased
Sort Method: external merge Disk: 33880kB
" Buffers: shared hit=3053791 read=69706, temp read=4244 written=8484"
CTE cte
-> Recursive Union (cost=0.56..117.04 rows=101 width=39) (actual time=5.886..32288.212 rows=623680 loops=1)
Buffers: shared hit=3053788 read=69706
-> Limit (cost=0.56..0.77 rows=1 width=39) (actual time=5.885..5.885 rows=1 loops=1)
Buffers: shared hit=5 read=3
-> Index Scan using purchase_history_idx on purchase_history (cost=0.56..1809076.40 rows=8543899 width=39) (actual time=5.882..5.882 rows=1 loops=1)
Index Cond: (customer_id = 123)
Buffers: shared hit=5 read=3
-> Nested Loop (cost=0.56..11.42 rows=10 width=39) (actual time=0.050..0.051 rows=1 loops=623680)
Buffers: shared hit=3053783 read=69703
-> WorkTable Scan on cte c (cost=0.00..0.20 rows=10 width=516) (actual time=0.000..0.000 rows=1 loops=623680)
-> Limit (cost=0.56..1.10 rows=1 width=39) (actual time=0.049..0.049 rows=1 loops=623680)
Buffers: shared hit=3053783 read=69703
-> Index Scan using purchase_history_idx on purchase_history purchase_history_1 (cost=0.56..1537840.29 rows=2847966 width=39) (actual time=0.048..0.048 rows=1 loops=623680)
Index Cond: ((customer_id = 123) AND ((product_name)::text > (c.product_name)::text))
Buffers: shared hit=3053783 read=69703
-> CTE Scan on cte (cost=0.00..2.02 rows=101 width=532) (actual time=5.889..32826.816 rows=623680 loops=1)
" Buffers: shared hit=3053788 read=69706, temp written=4240"
Planning time: 0.278 ms
Execution time: 33873.798 ms
Notice it did in-memory sort: Sort Method: quicksort Memory: 853kB
for riskop but external disk sort: Sort Method: external merge Disk: 33880kB
for me.
If it is not a solvable problem with relational DB, I wonder if there's any other non relational DB, or big data based solution, as long as it meets 2 requirements:
- Reasonable response time(2 seconds for example).
- Realtime without delay.