113

I have been experimenting with Redis and MongoDB lately and it would seem that there are often cases where you would store an array of id's in either MongoDB or Redis. I'll stick with Redis for this question since I am asking about the MySQL IN operator.

I was wondering how performant it is to list a large number (300-3000) of id's inside the IN operator, which would look something like this:

SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)

Imagine something as simple as a products and categories table which you might normally JOIN together to get the products from a certain category. In the example above you can see that under a given category in Redis ( category:4:product_ids ) I return all the product ids from the category with id 4, and place them in the above SELECT query inside the IN operator.

How performant is this?

Is this an "it depends" situation? Or is there a concrete "this is (un)acceptable" or "fast" or "slow" or should I add a LIMIT 25, or doesn't that help?

SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 3000)
LIMIT 25

Or should I trim the array of product id's returned by Redis to limit it to 25 and only add 25 id's to the query rather than 3000 and LIMIT-ing it to 25 from inside the query?

SELECT id, name, price
FROM products
WHERE id IN (1, 2, 3, 4, ...... 25)

Any suggestions/feedback is much appreciated!

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Michael van Rooijen
  • 6,683
  • 5
  • 37
  • 33
  • I'm not sure exactly what you are asking? One query with "id IN(1,2,3, ...3000))" is faster than 3000 queries with "id = value". But a join with "category = 4" will be faster than both of the above. – Ronnis Dec 23 '10 at 12:51
  • Right, though since a product can belong to multiple categories I you cannot do the "category = 4". Using Redis I would store all the id's of the products that belong in a certain categories and then query on that. I guess the real question is, how would the `id IN (1,2,3 ... 3000)` perform compared to JOIN table of `products_categories`. Or is that what you were saying? – Michael van Rooijen Dec 23 '10 at 20:57
  • Just be careful from that bug in MySql http://stackoverflow.com/questions/3417074/why-would-an-in-condition-be-slower-than-in-sql – Itay Moav -Malimovka May 07 '11 at 05:12
  • Of course there is no reason why this shouldn't be as efficient as any other method of retrieving indexed rows; it just depends on whether database authors have tested and optimized for it. In terms of computational complexity we're going to do at worst an O(n log N) sort on the `IN` clause (this might even be linear on a sorted list like you show, depending on the algorithm), and then linear intersection/lookups. – jberryman Apr 30 '17 at 18:58

6 Answers6

47

Generally speaking, if the IN list gets too large (for some ill-defined value of 'too large' that is usually in the region of 100 or smaller), it becomes more efficient to use a join, creating a temporary table if need so be to hold the numbers.

If the numbers are a dense set (no gaps - which the sample data suggests), then you can do even better with WHERE id BETWEEN 300 AND 3000.

However, presumably there are gaps in the set, at which point it may be better to go with the list of valid values after all (unless the gaps are relatively few in number, in which case you could use:

WHERE id BETWEEN 300 AND 3000 AND id NOT BETWEEN 742 AND 836

Or whatever the gaps are.

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • 62
    Can you please give an example of "use a join, creating a temporary table"? – Jake Oct 13 '16 at 07:46
  • if the data set came from an interface (multi-select element) and there is gaps in the selected data and this gaps are not a sequential gap (missing: 457, 490, 658, ..) then `AND id NOT BETWEEN XXX AND XXX` will not work and it's better to stick with the equivalent `(x = 1 OR x = 2 OR x = 3 ... OR x = 99)` as @David Fells wrote. – devasia2112 Apr 28 '19 at 11:36
  • 2
    in my experience - working on e-commerce websites, we have to show search results of ~50 unrelated product ID's, we had better results with "1. 50 separate queries", vs "2. one query with many values in the "IN clause"". I don't have any way to prove it for the moment, except that the query #2 will always show up as a slow query in our monitoring systems, whereas #1 will never show up, regardless that the amount of the executions are in the millions... does anyone have the same experience? (we can maybe relate it to better caching, or allowing other queries to interlace between queries...) – Chaim Klar Nov 06 '19 at 16:00
  • 4
    @Chaim, of course the separate query isn't slow. Each one only has to fetch one record. The profiler doesn't know that a set of queries are related and need to be aggregated for comparison. – Daniel Nov 20 '20 at 19:38
35

I have been doing some tests, and as David Fells says in his answer, it is quite well optimized. As a reference, I have created an InnoDB table with 1,000,000 registers and doing a select with the "IN" operator with 500,000 random numbers, it takes only 2.5 seconds on my MAC; selecting only the even registers takes 0.5 seconds.

The only problem that I had is that I had to increase the max_allowed_packet parameter from the my.cnf file. If not, a mysterious “MYSQL has gone away” error is generated.

Here is the PHP code that I use to make the test:

$NROWS =1000000;
$SELECTED = 50;
$NROWSINSERT =15000;

$dsn="mysql:host=localhost;port=8889;dbname=testschema";
$pdo = new PDO($dsn, "root", "root");
$pdo->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);

$pdo->exec("drop table if exists `uniclau`.`testtable`");
$pdo->exec("CREATE  TABLE `testtable` (
        `id` INT NOT NULL ,
        `text` VARCHAR(45) NULL ,
        PRIMARY KEY (`id`) )");

$before = microtime(true);

$Values='';
$SelValues='(';
$c=0;
for ($i=0; $i<$NROWS; $i++) {
    $r = rand(0,99);
    if ($c>0) $Values .= ",";
    $Values .= "( $i , 'This is value $i and r= $r')";
    if ($r<$SELECTED) {
        if ($SelValues!="(") $SelValues .= ",";
        $SelValues .= $i;
    }
    $c++;

    if (($c==100)||(($i==$NROWS-1)&&($c>0))) {
        $pdo->exec("INSERT INTO `testtable` VALUES $Values");
        $Values = "";
        $c=0;
    }
}
$SelValues .=')';
echo "<br>";


$after = microtime(true);
echo "Insert execution time =" . ($after-$before) . "s<br>";

$before = microtime(true);  
$sql = "SELECT count(*) FROM `testtable` WHERE id IN $SelValues";
$result = $pdo->prepare($sql);  
$after = microtime(true);
echo "Prepare execution time =" . ($after-$before) . "s<br>";

$before = microtime(true);

$result->execute();
$c = $result->fetchColumn();

$after = microtime(true);
echo "Random selection = $c Time execution time =" . ($after-$before) . "s<br>";



$before = microtime(true);

$sql = "SELECT count(*) FROM `testtable` WHERE id %2 = 1";
$result = $pdo->prepare($sql);
$result->execute();
$c = $result->fetchColumn();

$after = microtime(true);
echo "Pairs = $c Exdcution time=" . ($after-$before) . "s<br>";

And the results:

Insert execution time =35.2927210331s
Prepare execution time =0.0161771774292s
Random selection = 499102 Time execution time =2.40285992622s
Pairs = 500000 Exdcution time=0.465420007706s
Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
jbaylina
  • 4,408
  • 1
  • 30
  • 39
  • For the sake of others, I'll add that running in VirtualBox (CentOS) on my Late 2013 MBP with an i7, the third line (the one relevant to the question) of the output was: *Random selection = 500744 Time execution time =53.458173036575s*.. 53 seconds might be tolerable depending on your application. For my uses, not really. Also, note that the test for even numbers isn't relevant for the question at hand since it uses the modulo operator (`%`) with an equals operator (`=`) instead of `IN()`. – rinogo May 01 '15 at 22:40
  • It's relevant because it is a way to compare a query with the IN operator with a similar query without this functionality. May be the higer time you get is because it is a download time, because your machine is swapipng or working in another virtual machine. – jbaylina May 02 '15 at 20:38
17

You can create a temporary table where you can put any number of IDs and run nested query Example:

CREATE [TEMPORARY] TABLE tmp_IDs (`ID` INT NOT NULL,PRIMARY KEY (`ID`));

and select:

SELECT id, name, price
FROM products
WHERE id IN (SELECT ID FROM tmp_IDs);
Vladimir Jotov
  • 188
  • 1
  • 8
4

Using IN with a large parameter set on a large list of records will in fact be slow.

In the case that I solved recently I had two where clauses, one with 2,50 parameters and the other with 3,500 parameters, querying a table of 40 Million records.

My query took 5 minutes using the standard WHERE IN. By instead using a subquery for the IN statement (putting the parameters in their own indexed table), I got the query down to TWO seconds.

Worked for both MySQL and Oracle in my experience.

Giacomo1968
  • 25,759
  • 11
  • 71
  • 103
yoyodunno
  • 617
  • 1
  • 8
  • 19
  • 8
    I didn't get your point at "By instead using a subquery for the IN statement (putting the parameters in their own indexed table)". Did you mean that instead of using "WHERE ID IN(1,2,3)" we should use "WHERE ID IN(SELECT id FROM xxx)"? – Istiyak Tailor Aug 20 '20 at 05:44
  • Agree with istiyak as your statement is not clear – Manish Gupta Jun 23 '21 at 07:33
  • @ManishGupta Sorry for not being clear, but yes I think that's what I meant - placing all the values into a indexed table, and adding that SELECT statement as a subquery to the IN statement. Hard to remember since this was years ago. – yoyodunno Jun 23 '21 at 17:01
3

IN is fine, and well optimized. Make sure you use it on an indexed field and you're fine.

It's functionally equivalent to:

(x = 1 OR x = 2 OR x = 3 ... OR x = 99)

As far as the DB engine is concerned.

EDIT: Please notice this answer was written in 2011, and see the comments of this answer discussing the latest MySQL features.

ch271828n
  • 15,854
  • 5
  • 53
  • 88
David Fells
  • 6,678
  • 1
  • 22
  • 34
  • 1
    Not realy. I use IN clouse to fetch 5k records from the DB. IN clouse contains list of PKs so related column is indexed and guaranteed to be unique. EXPLAIN says, that full table scan is performed insteed of using PK lookup in "fifo-queue-alike" style. – Antoniossss May 05 '16 at 10:47
  • On MySQL I don't believe they are _"functionally equivalent"_. `IN` uses optimizations for better performance. – Joshua Pinter Aug 21 '19 at 18:58
  • 1
    Josh, the answer was from 2011 - I'm sure things have changed since then, but back in the day IN was flat out converted to be a series of OR statements. – David Fells Aug 22 '19 at 19:06
  • 7
    This answer is not correct. From *High Performance MySQL*: Not so in MySQL, which sorts the values in the IN( ) list and uses a fast binary search to see whether a value is in the list. This is O(log n) in the size of the list, whereas an equivalent series of OR clauses is O(n) in the size of the list (i.e., much slower for large lists). – Bert Sep 18 '19 at 19:26
  • 1
    Bert - yes. This answer is obsolete. Feel free to suggest an edit. – David Fells Sep 20 '19 at 14:35
-3

When you provide many values for the IN operator it first must sort it to remove duplicates. At least I suspect that. So it would be not good to provide too many values, as sorting takes N log N time.

My experience proved that slicing the set of values into smaller subsets and combining the results of all the queries in the application gives best performance. I admit that I gathered experience on a different database (Pervasive), but the same may apply to all the engines. My count of values per set was 500-1000. More or less was significantly slower.

Jarekczek
  • 7,456
  • 3
  • 46
  • 66
  • 1
    I know this is 7 years on, but the problem with this answer is simply that it’s a comment based on an educated guess. – Giacomo1968 Feb 09 '19 at 02:21