0

background

I've got a product_visits table that looks like this:

create table product_visits (product_id int, visitor_id int);

insert into product_visits values
  (1, 1),
  (1, 2),
  (1, 3),
  (1, 4),
  (1, 5),
  (2, 1),
  (2, 2),
  (2, 3),
  (2, 4),
  (2, 5),
  (3, 1),
  (3, 2),
  (3, 3),
  (4, 1),
  (4, 2),
  (5, 1);

or

| product_id | visitor_id |
|------------|------------|
|          1 |          1 |
|          1 |          2 |
|          1 |          3 |
|          1 |          4 |
|          1 |          5 |
|          2 |          1 |
|          2 |          2 |
|          2 |          3 |
|          2 |          4 |
|          2 |          5 |
|          3 |          1 |
|          3 |          2 |
|          3 |          3 |
|          4 |          1 |
|          4 |          2 |
|          5 |          1 |

I'm currently selecting the top 2 other products that visitors of a given product have also visited, using the following SQL:

SELECT a.`product_id`, count(a.`product_id`) visits
FROM `product_visits` a
INNER JOIN `product_visits` b ON a.`visitor_id` = b.`visitor_id`
WHERE b.`product_id` = ?
  AND a.`product_id` != ?
GROUP BY a.`product_id`
ORDER BY visits DESC 
LIMIT 2

For example, if run for product_id = 1, I'd get the following results with the above data:

| product_id | visits |
|------------|--------|
|          2 |      5 |
|          3 |      3 |

This is working okay when trying to get results for one product at a time.

problem

What I need to do is rewrite the above query so that it will work for all products in the product_visits table with a single query. I still need the results to be limited to just the top n results (e.g. 2) per product. For example, with the above data, the results that I'd like to see are as follows:

| target_product_id | related_product_id | visits |
|-------------------|--------------------|--------|
|                 1 |                  2 |      5 |
|                 1 |                  3 |      3 |
|                 2 |                  1 |      5 |
|                 2 |                  3 |      3 |
|                 3 |                  1 |      3 |
|                 3 |                  2 |      3 |
|                 4 |                  1 |      2 |
|                 4 |                  2 |      2 |
|                 5 |                  1 |      1 |
|                 5 |                  2 |      1 |

My closest attempt meeting the above goal is with the following code:

SELECT a.`product_id` target_product_id, b.`product_id` related_product_id, count(*) visits
FROM `product_visits` a
INNER JOIN `product_visits` b ON a.`visitor_id` = b.`visitor_id`
WHERE b.`product_id` != a.`product_id`
GROUP BY a.`product_id`, b.`product_id`
ORDER BY target_product_id ASC, visits DESC

which gives me the following results, but is still missing limiting the results to the top n matches per target_product_id:

| target_product_id | related_product_id | visits |
|-------------------|--------------------|--------|
|                 1 |                  2 |       5|
|                 1 |                  3 |       3|
|                 1 |                  4 |       2|
|                 1 |                  5 |       1|
|                 2 |                  1 |       5|
|                 2 |                  3 |       3|
|                 2 |                  4 |       2|
|                 2 |                  5 |       1|
|                 3 |                  1 |       3|
|                 3 |                  2 |       3|
|                 3 |                  4 |       2|
|                 3 |                  5 |       1|
|                 4 |                  3 |       2|
|                 4 |                  1 |       2|
|                 4 |                  2 |       2|
|                 4 |                  5 |       1|
|                 5 |                  3 |       1|
|                 5 |                  1 |       1|
|                 5 |                  4 |       1|
|                 5 |                  2 |       1|

I've been pounding my head on this for a while now but haven't been able to come up with the complete solution.

update #1

I ran Gordon Linoff's suggested SQL below against my production data - in a development db of course. I've got ~2.6 million records in my product_visits table. Leaving the limit set to 2, the query took 41.8572 seconds to run. Almost all of that time (40.4 s) was spent Copying To Tmp Table.

The output from running that SQL though EXPLAIN is as follows:

id | select_type | table      | type   | possible_keys    | key         | key_len | ref                   | rows    | Extra                                        |
 1 | PRIMARY     | <derived2> | ALL    | NULL             | NULL        | NULL    | NULL                  | 1161898 | Using where; Using filesort                  |
 2 | DERIVED     | <derived4> | system | NULL             | NULL        | NULL    | NULL                  |       1 |                                              |
 2 | DERIVED     | <derived3> | ALL    | NULL             | NULL        | NULL    | NULL                  | 1161898 |                                              |
 4 | DERIVED     | NULL       | NULL   | NULL             | NULL        | NULL    | NULL                  |    NULL | No tables used                               |
 3 | DERIVED     | a          | index  | PRIMARY,ndx_user | ndx_product | 24      | NULL                  | 2603025 | Using index; Using temporary; Using filesort | 
 3 | DERIVED     | b          | ref    | PRIMARY,ndx_user | PRIMARY     | 116     | product_visits.a.user |       1 | Using where; Using index                     |

While that SQL does pretty much exactly what I was looking for, the performance is killing me. Any ideas on speeding this up?

Community
  • 1
  • 1
Brandon
  • 105
  • 2
  • 9
  • Are you okay with using subqueries in your query? – hofan41 Mar 17 '16 at 21:34
  • Possible duplicate of [Get top n records for each group of grouped results](http://stackoverflow.com/questions/12113699/get-top-n-records-for-each-group-of-grouped-results) – Shadow Mar 17 '16 at 22:36

1 Answers1

0

I think the simplest method in MySQL is to use variables:

SELECT tr.*
FROM (SELECT tr.*,
             (@rn := if(@p = target_product_id, @rn + 1,
                        if(@p := target_product_id, 1, 1)
                       )
             ) as rn
      FROM (SELECT a.`product_id` as target_product_id, b.`product_id` as related_product_id, 
                   count(*) visits
            FROM `product_visits` a INNER JOIN
                 `product_visits` b
                 ON a.`visitor_id` = b.`visitor_id` AND
                    b.`product_id` != a.`product_id`
            GROUP BY a.`product_id`, b.`product_id`
            ORDER BY a.`product_id`, COUNT(*) desc
           ) tr CROSS JOIN
           (SELECT @p := -1, @rn := 0) params
      ) tr
WHERE rn <= 2
ORDER BY target_product_id ASC, visits DESC;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Thanks Gordon. Your SQL seems to be doing exactly what I was looking for. – Brandon Mar 17 '16 at 22:35
  • Gordon, looks like I've got some performance problems running the above SQL against my production data. I've added an update to the original question describing what I'm seeing. – Brandon Mar 17 '16 at 23:17