MySql Item-to-Item Recommender Performance

Question

I am seeing unexplained performance issues with MySql queries.

The data is a MySql InnoDB table with 3.85 million rows of item-to-item correlation data.

For item "item_i", another item "also_i" was ordered by "count_i" people.

CREATE TABLE `hl_also2sm` (
  `item_i` int(10) unsigned NOT NULL DEFAULT '0',
  `also_i` int(10) unsigned NOT NULL DEFAULT '0',
  `count_i` int(10) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`item_i`,`also_i`),
  KEY `count_i` (`count_i`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

A sample correlation is done by taking a list of items, finding the correlating items, and returning the approximate time the MySql query took to run.

// Javascript in NodeJS with MySql, on Debian Linux
var sql = require('./routes/sqlpool'); // connects to DB
var cmd = util.promisify(sql.cmd); // Promise of raw MySql command function

async function inquiry(NumberOfItems){
  // generate random list of items to perform correlation against
  var rtn = await cmd(`select DISTINCT item_i from hl_also2sm order by RAND() limit ${NumberOfItems}`);
  var items_l = rtn.map((h)=>{return h.item_i});

  var ts = Date.now();

  // get top 50 correlated items
  var c = `select also_i,COUNT(*) as cnt,SUM(count_i) as sum from hl_also2sm 
    where item_i IN (${items_l.join(",")}) 
    AND also_i NOT IN (${items_l.join(",")}) 
    group by also_i 
    order by cnt DESC,sum DESC limit 50`;
  await cmd(c);

  var MilliSeconds = Date.now()-ts;
  return MilliSeconds;
};

To test this over a range of items

async function inquiries(){
 for (items=200;items<3000;items+=200) {
   var Data = [];
   for (var i=0;i<10;i++) {
     Data.push(await inquiry(items));
   }
   Data.sort();
   console.log(`${items} items - min:${Data[0]} max:${Data[9]}`);
 }

The results being

200 items - min:315 max:331
400 items - min:1214 max:1235
600 items - min:2669 max:2718
800 items - min:4796 max:4823
1000 items - min:6872 max:7006
1200 items - min:134 max:154
1400 items - min:147 max:169
1600 items - min:162 max:198
1800 items - min:190 max:212
2000 items - min:210 max:244
2200 items - min:237 max:258
2400 items - min:248 max:293
2600 items - min:263 max:302
2800 items - min:292 max:322

which is very puzzling.

Why is 2000 items over 25X faster than 1000 items??

The 1000 item select EXPLAIN is

| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                        |
|  1 | SIMPLE      | hl_also2sm | index | PRIMARY       | count_i | 4       | NULL | 4043135 | Using where; Using index; Using temporary; Using filesort |

the 2000 select EXPLAIN is

| id | select_type | table      | type  | possible_keys | key     | key_len | ref  | rows   | Extra                                        |
|  1 | SIMPLE      | hl_also2sm | range | PRIMARY       | PRIMARY | 4       | NULL | 758326 | Using where; Using temporary; Using filesort |

I ran this many times, each producing similar results.

Yes, many of my users have shown interest in thousands of items through pageviews, commenting on, viewing pictures of, or ordering. I would like to produce a good "you might also like" for them.

Summary of problem

select  also_i,
        COUNT(*) as cnt,
        SUM(count_i) as sum
    from  hl_also2sm
    where  item_i     IN (...)   -- Varying the number of IN items
      AND  also_i NOT IN (...)   -- Varying the number of IN items
    group by  also_i
    order by  cnt DESC, sum DESC
    limit  50

For <= 1K items in the IN lists, the query uses KEY(count_i) runs slower.
For > 1K items in the IN lists, the query does a table scan and runs faster.
Why??

**WARNING**: Do not use string interpolation to compose queries as this leads to nasty [SQL injection bugs](http://bobby-tables.com/). Instead use placeholder values to represent the data. Most drivers support these natively, and wrappers like [Sequelize](http://docs.sequelizejs.com) can make it very easy to use them. — tadman, Jul 11 '19 at 17:01
@tadman thanks. This is not production code. Any string interpolation inputs would be sanitized in production. This will become a backend-function, with no user input, so I don't see where injection could come from anyway. From what I have read, having to use FIND_IN_SET for my "IN ()" syntax, if I used placeholders, is not an option since it does not used indexes and would be very slow. — J. McNerney, Jul 11 '19 at 17:32
Doesn't matter. **Use placeholder values**. Your "sanitization" code could be total garbage for all I know. These sorts of bugs are really obnoxious and time-consuming to fix, but doing it correctly the first time is effortless. There's zero excuses for not doing it properly and avoiding all the headaches that come with injection issues. — tadman, Jul 11 '19 at 17:38
Placeholders also have zero impact on indexing or a lack of indexing, they're just used to introduce data into your query. Since those items come from a `SELECT` what you really want here is a `JOIN` to avoid the round-trip through the application layer. Do this all as one query. — tadman, Jul 11 '19 at 17:39
@tadman Thank you. I'm all for "doing it the right way". Could you provide a re-write of the "var c" query using placeholders? In production, this line will be called via an API, which passes in a list of items to check against. This is why I didn't use a JOIN, so don't worry about trying to JOIN the "var rtn" line into the query. — J. McNerney, Jul 11 '19 at 18:07
https://stackoverflow.com/questions/11957643/mysql-variable-format-for-a-not-in-list-of-values — J. McNerney, Jul 11 '19 at 18:08
That's why I linked to Sequelize documentation. That's a great database layer for Node, supports multiple databases, is promise driven, and has fantastic placeholder support. Normally all you need to do is replace your `...${x}` stuff with `...?` and then supply additional data parameters. — tadman, Jul 11 '19 at 18:10
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/196337/discussion-between-j-mcnerney-and-tadman). — J. McNerney, Jul 11 '19 at 18:58
The `EXPLAIN` will probably give a clue of the 25X. Show them to us. — Rick James, Jul 12 '19 at 05:46
What language is that test program written in? Or where is `cmd()` defined? — Rick James, Jul 12 '19 at 05:48
`DESCRIBE` is not as descriptive as `SHOW CREATE TABLE`; please provide the latter. — Rick James, Jul 12 '19 at 06:02
@RickJames Thanks!!! The question has been updated. The slow select uses count_i as the primary key, while the fast one uses PRIMARY. DESCRIBE is a great feature, thanks for the information. — J. McNerney, Jul 12 '19 at 15:37

J. McNerney · Answer 1 · 2019-07-15T14:24:52.163

Changing the

PRIMARY KEY (`item_i`,`also_i`)

to

KEY (`item_i`)
KEY (`also_i`)

seemed to solve the issue.

CREATE TABLE `hl_also2sm` (
  `item_i` int(10) unsigned NOT NULL DEFAULT '0',
  `also_i` int(10) unsigned NOT NULL DEFAULT '0',
  `count_i` int(10) unsigned NOT NULL DEFAULT '0',
  KEY `count_i` (`count_i`),
  KEY `item_i` (`item_i`),
  KEY `also_i` (`also_i`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

giving performance of

200 min:113 max:85
400 min:148 max:193
600 min:225 max:268
800 min:292 max:362
1000 min:333 max:450
1200 min:360 max:536
1400 min:521 max:618
1600 min:607 max:727
1800 min:698 max:789
2000 min:767 max:841
2200 min:765 max:952
2400 min:1000 max:987
2600 min:1011 max:1241
2800 min:1118 max:1186

which looks reasonable, though I would like it to be faster. Suggestions on re-structuring this for better performance, would be appreciated.

Trying 'USE INDEX(PRIMARY)', to force the key used, was slower.
Dropping index on count_i was slower.

Changing the ENGINE=MEMORY, since this is a read-only table which is small enough to fit into memory (200MB table memory image on 16GB machine), yielded:

200 min:16 max:23
400 min:28 max:38
600 min:46 max:56
800 min:58 max:69
1000 min:71 max:89
1200 min:100 max:99
1400 min:105 max:99
1600 min:116 max:132
1800 min:126 max:153
2000 min:139 max:165
2200 min:158 max:181
2400 min:171 max:194
2600 min:197 max:208
2800 min:203 max:223

This seems very reasonable for my purposes.

Please provide `EXPLAIN SELECT ...`. It is not wise to remove the PK. But is the pair `(item,also)` unique? Do you need the uniqueness constraint; you have removed it. — Rick James, Jul 12 '19 at 17:25
The EXPLAINs are in the question now. The PK uniqueness is not required in this table. This table is update once a day, and I can manage the uniqueness during the update process. I'd rather have fast reads. — J. McNerney, Jul 12 '19 at 18:42
Some reads are _faster_ with a suitable `PRIMARY KEY`, some are slower. Let's see the important `SELECTs`. — Rick James, Jul 13 '19 at 18:40
@RickJames the table information is in the 1st code block, and the critical select "get top 50 correlated items" is in the 2nd. Speeds may be reasonable, given the length of the "IN()" arrays, but I'm no MySql expert. — J. McNerney, Jul 15 '19 at 13:40

MySql Item-to-Item Recommender Performance

1 Answers1