background
I've got a product_visits table that looks like this:
create table product_visits (product_id int, visitor_id int);
insert into product_visits values
(1, 1),
(1, 2),
(1, 3),
(1, 4),
(1, 5),
(2, 1),
(2, 2),
(2, 3),
(2, 4),
(2, 5),
(3, 1),
(3, 2),
(3, 3),
(4, 1),
(4, 2),
(5, 1);
or
| product_id | visitor_id |
|------------|------------|
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 1 | 4 |
| 1 | 5 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 2 | 4 |
| 2 | 5 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 4 | 1 |
| 4 | 2 |
| 5 | 1 |
I'm currently selecting the top 2 other products that visitors of a given product have also visited, using the following SQL:
SELECT a.`product_id`, count(a.`product_id`) visits
FROM `product_visits` a
INNER JOIN `product_visits` b ON a.`visitor_id` = b.`visitor_id`
WHERE b.`product_id` = ?
AND a.`product_id` != ?
GROUP BY a.`product_id`
ORDER BY visits DESC
LIMIT 2
For example, if run for product_id = 1, I'd get the following results with the above data:
| product_id | visits |
|------------|--------|
| 2 | 5 |
| 3 | 3 |
This is working okay when trying to get results for one product at a time.
problem
What I need to do is rewrite the above query so that it will work for all products in the product_visits
table with a single query. I still need the results to be limited to just the top n
results (e.g. 2) per product. For example, with the above data, the results that I'd like to see are as follows:
| target_product_id | related_product_id | visits |
|-------------------|--------------------|--------|
| 1 | 2 | 5 |
| 1 | 3 | 3 |
| 2 | 1 | 5 |
| 2 | 3 | 3 |
| 3 | 1 | 3 |
| 3 | 2 | 3 |
| 4 | 1 | 2 |
| 4 | 2 | 2 |
| 5 | 1 | 1 |
| 5 | 2 | 1 |
My closest attempt meeting the above goal is with the following code:
SELECT a.`product_id` target_product_id, b.`product_id` related_product_id, count(*) visits
FROM `product_visits` a
INNER JOIN `product_visits` b ON a.`visitor_id` = b.`visitor_id`
WHERE b.`product_id` != a.`product_id`
GROUP BY a.`product_id`, b.`product_id`
ORDER BY target_product_id ASC, visits DESC
which gives me the following results, but is still missing limiting the results to the top n
matches per target_product_id
:
| target_product_id | related_product_id | visits |
|-------------------|--------------------|--------|
| 1 | 2 | 5|
| 1 | 3 | 3|
| 1 | 4 | 2|
| 1 | 5 | 1|
| 2 | 1 | 5|
| 2 | 3 | 3|
| 2 | 4 | 2|
| 2 | 5 | 1|
| 3 | 1 | 3|
| 3 | 2 | 3|
| 3 | 4 | 2|
| 3 | 5 | 1|
| 4 | 3 | 2|
| 4 | 1 | 2|
| 4 | 2 | 2|
| 4 | 5 | 1|
| 5 | 3 | 1|
| 5 | 1 | 1|
| 5 | 4 | 1|
| 5 | 2 | 1|
I've been pounding my head on this for a while now but haven't been able to come up with the complete solution.
update #1
I ran Gordon Linoff's suggested SQL below against my production data - in a development db of course. I've got ~2.6 million records in my product_visits
table. Leaving the limit set to 2, the query took 41.8572 seconds to run. Almost all of that time (40.4 s) was spent Copying To Tmp Table.
The output from running that SQL though EXPLAIN
is as follows:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1161898 | Using where; Using filesort |
2 | DERIVED | <derived4> | system | NULL | NULL | NULL | NULL | 1 | |
2 | DERIVED | <derived3> | ALL | NULL | NULL | NULL | NULL | 1161898 | |
4 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
3 | DERIVED | a | index | PRIMARY,ndx_user | ndx_product | 24 | NULL | 2603025 | Using index; Using temporary; Using filesort |
3 | DERIVED | b | ref | PRIMARY,ndx_user | PRIMARY | 116 | product_visits.a.user | 1 | Using where; Using index |
While that SQL does pretty much exactly what I was looking for, the performance is killing me. Any ideas on speeding this up?