Any possibility to speed up WHERE IN or replace it with faster alternative?

Question

I am trying to speed up select in query below where I have over 1000 items in WHERE IN

table:

CREATE TABLE `user_item` (
  `user_id` int(11) unsigned NOT NULL,
  `item_id` int(11) unsigned NOT NULL,
  PRIMARY KEY (`user_id`,`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

query:

SELECT
    item_id
FROM
    user_item
WHERE
    user_id = 2
    AND item_id IN(3433456,67584634,587345,...)

With 1000 items in IN list, query takes about 3 seconds to execute. is there any optimization that can be done in this case? There can be billions of rows in this table. Is there an alternative to doing this faster be it with another DB or programming method?

UPDATE:

Here's results of explain:

If I have 999 items in the IN(...) statement:

+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
| id   | select_type | table    | type  | possible_keys | key     | key_len | ref  | rows | Extra                    |
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+
|    1 | SIMPLE      | user_item | range | PRIMARY       | PRIMARY | 8       | NULL |  999 | Using where; Using index |
+------+-------------+----------+-------+---------------+---------+---------+------+------+--------------------------+

If I have 1000 items in IN(...) statement:

+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
| id   | select_type  | table       | type   | possible_keys | key     | key_len | ref                | rows | Extra                    |
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+
|    1 | PRIMARY      | <subquery2> | ALL    | distinct_key  | NULL    | NULL    | NULL               | 1000 |                          |
|    1 | PRIMARY      | user_item    | eq_ref | PRIMARY       | PRIMARY | 8       | const,tvc_0._col_1 |    1 | Using where; Using index |
|    2 | MATERIALIZED | <derived3>  | ALL    | NULL          | NULL    | NULL    | NULL               | 1000 |                          |
|    3 | DERIVED      | NULL        | NULL   | NULL          | NULL    | NULL    | NULL               | NULL | No tables used           |
+------+--------------+-------------+--------+---------------+---------+---------+--------------------+------+--------------------------+

Update 2

I want to explain why I need to do above:

I want to give the user the ability to list items ordered by sort_criteria_1, sort_criteria_2 or sort_criteria_3 and exclude from the list those items that have been marked by given (n) users in the user_item table.

Here's sample schema:

CREATE TABLE `user` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(45) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `item` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `file` varchar(45) NOT NULL,
  `sort_criteria_1` int(11) DEFAULT NULL,
  `sort_criteria_2` int(11) DEFAULT NULL,
  `sort_criteria_3` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_sc1` (`sort_criteria_1`),
  KEY `idx_sc2` (`sort_criteria_2`),
  KEY `idx_sc3` (`sort_criteria_3`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `user_item` (
  `user_id` int(11) NOT NULL,
  `item_id` int(11) NOT NULL,
  PRIMARY KEY (`user_id`,`item_id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Here's how I would get items ordered by sort_criteria_2 excluding ones that have record by users (300, 6, 1344, 24) in user_item table:

SELECT
    i.id,
FROM
    item i
    LEFT JOIN user_item ui1 ON (i.id = ui1.item_id AND ui1.user_id = 300)
    LEFT JOIN user_item ui2 ON (i.id = ui2.item_id AND ui2.user_id = 6)
    LEFT JOIN user_item ui3 ON (i.id = ui3.item_id AND ui3.user_id = 1344)
    LEFT JOIN user_item ui4 ON (i.id = ui4.item_id AND ui4.user_id = 24)
WHERE
    ui1.item_id IS NULL
    AND ui2.item_id IS NULL
    AND ui3.item_id IS NULL
    AND ui4.item_id IS NULL
ORDER BY
    v.sort_criteria_2
LIMIT
    800

Main problem with above approach is that more users I'm filtering by, more expensive query gets. I want the toll for filtering to be paid by client browser. So I would send list of items and list of matching user_item records per user to the client to filter by. This would help with sharding as well, since I would not have to have user_item tables or set of records on the same machine.

Your table does not have a column `parent_id`, how have you created the table? Also, does `parent_id` has an index? — Progman, Jun 14 '20 at 18:18
Also additional theoretical question. Is `item_id` important for your query, what is the percentage of items it filters out? Have you tried getting all items by `user_id` and filtering them out programmatically? — stepio, Jun 14 '20 at 18:21
@stepio Please see the updated query. I changed names when posting to stackoverflow and forgot to update query portion. See update above. — Jimbotron, Jun 14 '20 at 18:23
@Jimbotron, you still use `parent_id` in the key. Also updated my question above - it still makes sense. — stepio, Jun 14 '20 at 18:25
@Jimbotron Please [edit] your question to include the result of `EXPLAIN SELECT....` to your question. — Progman, Jun 14 '20 at 18:58
@Progman explain results added for 999 items as 1000 items because they are different. — Jimbotron, Jun 14 '20 at 19:56
how many different item ids are there? 1000 is a lot of item ids; how are they chosen for the query? — Bohemian, Jun 14 '20 at 23:25
@Bohemian assume there are 5 billion rows in the table and average number of items to check for is 800 - 1000 — Jimbotron, Jun 14 '20 at 23:31
@Jimbotron OK, but you haven't answered *either* of my questions: 1) how many **different** item ids are there, and 2) How are the items ids chosen for the query? — Bohemian, Jun 15 '20 at 01:13
@Bohemian there are about 100 million different item ids. There is a separate table of "lists" that contains these ids. Joins are not an option. — Jimbotron, Jun 15 '20 at 02:34
Are there other columns in the table? What is the value of `innodb_buffer_pool_size`? — Rick James, Jun 15 '20 at 04:07
If you run the same 999-item query twice in a row, how fast does the second one run? And for the 1000-item query? — Rick James, Jun 15 '20 at 04:08
@Jim whenever I hear "such and such is not an option", I ignore the statement unless it is justified. I usually go ahead ignoring the. statement regardless, because in this universe very few things are truly "not an option". So why, exactly, are "joins not an option"? Could you show us some schema of where the item id lists are stored? — Bohemian, Jun 15 '20 at 08:59
@Bohemian fair point. I respect your expertise and would not argue that point. In an attempt to find the most efficient way to filter lists, I decided to investigate if filtering arrays in clients browser would save resources compared to running joins on the server. In my test, javascript did an excellent job comparing arrays so I thought I had a solution, until I realised that where in with many items is not as fast as I'd like it to be. I believe this belongs in another question, is it ok if I post in another question and place link here for you to look at? — Jimbotron, Jun 15 '20 at 15:26
@Bohemian https://softwareengineering.stackexchange.com/questions/411514/ here is the explanation why I've gone down that rabit hole. Hope it will make sense. — Jimbotron, Jun 15 '20 at 18:14
How about just answering my question about where the item id lists come from, showing schema too. Just for context, my hunch is a join would be very efficient; perhaps returning the first row in milliseconds, but I need to see the schema. Can you provide a join query that works, albeit slowly? — Bohemian, Jun 15 '20 at 18:41
@Bohemian I have added second update, it includes schema and explanation why I would like to avoid joins other than performance — Jimbotron, Jun 15 '20 at 19:06
Does this answer your question? [Favourite performance tuning tricks](https://stackoverflow.com/questions/18783/favourite-performance-tuning-tricks) — philipxy, Jun 15 '20 at 21:51

score 0 · Answer 1 · answered Jun 14 '20 at 18:34

0

It's hard to tell exactly, but there could be lag on parsing your huge query because of many constant item_id values.

Have you tried getting just all the values by user_id ? As this field is first (main) in the PRIMARY KEY, relevant index would still be used.
Have you tried replacing constant list with a subquery ? Maybe you're interested in items of specific type, for example.
Make sure that you use Prepared statement concept - at least if your database and language support it. This would protect your code from possible SQL injections and enable database built-in query caching (if your database supports it).

answered Jun 14 '20 at 18:34

stepio

865
2
9
22

1. I don't need all values of the user, I need specific ones based on item_id. 2. Subquery would not work, I'm provided with ids and I need to do a lookup. 3. This is directly running in mariadb cli, at this stage only performance is a concern. – Jimbotron Jun 14 '20 at 18:38
No clues then - the task is so simple that there are many ways to complicate it, but no idea how to optimize it further within current constraints. – stepio Jun 14 '20 at 22:23
1

I've seen 70K items take a noticeable amount of time -- maybe 2 seconds. Something is strange about 3 seconds for a mere 1K. – Rick James Jun 15 '20 at 04:10

score 0 · Answer 2 · answered Jun 14 '20 at 23:09

0

Instead of putting the 1000 item_id's into IN-clause, you could put them into temporary table with index and join it with the user_item-table.

If you also have an index with both user_id and item_id, that would make the query fastest that it gets. The rest depends on the data distribution.

answered Jun 14 '20 at 23:09

slaakso

8,331
2
16
27

Nah, that's not even remotely feasible in my use case. There could be thousands of queries per second and temp tables and joins would not be faster. – Jimbotron Jun 14 '20 at 23:13
It looks like MariaDB is doing what you suggest in the 1000 case. – Rick James Jun 15 '20 at 04:07

Any possibility to speed up WHERE IN or replace it with faster alternative?

2 Answers2