4

If I compare

explain select * from Foo where find_in_set(id,'2,3');
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+
|  1 | SIMPLE      | User  | ALL  | NULL          | NULL | NULL    | NULL |    4 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+------+-------------+

with this one

explain select * from Foo where id in (2,3);
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  | rows | Extra       |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | User  | range | PRIMARY       | PRIMARY | 8       | NULL |    2 | Using where |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+

It is apparent that FIND_IN_SET does not exploit the primary key.

I want to put a query such as the above into a stored procedure, with the comma-separated string as an argument.

Is there any way to make the query behave like the second version, in which the index is used, but without knowing the content of the id set at the time the query is written?

spraff
  • 32,570
  • 22
  • 121
  • 229
  • no. it can't. indexes almost always can NOT be used when you're matching against derived values. and generally speaking, storing multiple values in a single field is almost always a sign of a bad design. ESPECIALLY when you need to extract/use individual components of those multiple values. you should normalize your table, and then this problem goes away. – Marc B Jun 23 '16 at 14:48
  • Part of the problem is that find_in_set is a string operator. If the ids you are querying against are not stored as character / varchar in the table, the query does an implicit conversion on the table column. The conversion function disables the use of the index, so you are left with a full table scan. – T Gray Jun 23 '16 at 15:11
  • @MarcB the database is normalized, the CSV string comes from the UI. "Get me data for the following people: 101,202,303" – spraff Jun 23 '16 at 15:34

3 Answers3

3

In reference to your comment:

@MarcB the database is normalized, the CSV string comes from the UI. "Get me data for the following people: 101,202,303"

This answer has a narrow focus on just those numbers separated by a comma. Because, as it turns out, you were not even talking about FIND_IN_SET afterall.

Yes, you can achieve what you want. You create a prepared statement that accepts a string as a parameter like in this Recent Answer of mine. In that answer, look at the second block that shows the CREATE PROCEDURE and its 2nd parameter which accepts a string like (1,2,3). I will get back to this point in a moment.

Not that you need to see it @spraff but others might. The mission is to get the type != ALL, and possible_keys and keys of Explain to not show null, as you showed in your second block. For a general reading on the topic, see the article Understanding EXPLAIN’s Output and the MySQL Manual Page entitled EXPLAIN Extra Information.

Now, back to the (1,2,3) reference above. We know from your comment, and your second Explain output in your question that it hits the following desired conditions:

  1. type = range (and in particular not ALL) . See the docs above on this.
  2. key is not null

These are precisely the conditions you have in your second Explain output, and the output that can be seen with the following query:

explain 
select * from ratings where id in (2331425, 430364, 4557546, 2696638, 4510549, 362832, 2382514, 1424071, 4672814, 291859, 1540849, 2128670, 1320803, 218006, 1827619, 3784075, 4037520, 4135373, ... use your imagination ..., ...,  4369522, 3312835);

where I have 999 values in that in clause list. That is an sample from this answer of mine in Appendix D than generates such a random string of csv, surrounded by open and close parentheses.

And note the following Explain output for that 999 element in clause below:

enter image description here

Objective achieved. You achieve this with a stored proc similar to the one I mentioned before in this link using a PREPARED STATEMENT (and those things use concat() followed by an EXECUTE).

The index is used, a Tablescan (meaning bad) is not experienced. Further readings are The range Join Type, any reference you can find on MySQL's Cost-Based Optimizer (CBO), this answer from vladr though dated, with a eye on the ANALYZE TABLE part, in particular after significant data changes. Note that ANALYZE can take a significant amount of time to run on ultra-huge datasets. Sometimes many many hours.

Sql Injection Attacks:

Use of strings passed to Stored Procedures are an attack vector for SQL Injection attacks. Precautions must be in place to prevent them when using user-supplied data. If your routine is applied against your own id's generated by your system, then you are safe. Note, however, that 2nd level SQL Injection attacks occur when data was put in place by routines that did not sanitize that data in a prior insert or update. Attacks put in place prior via data and used later (a sort of time bomb).

So this answer is Finished for the most part.

The below is a view of the same table with a minor modification to it to show what a dreaded Tablescan would look like in the prior query (but against a non-indexed column called thing).

Take a look at our current table definition:

CREATE TABLE `ratings` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `thing` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;

select min(id), max(id),count(*) as theCount from ratings;
+---------+---------+----------+
| min(id) | max(id) | theCount |
+---------+---------+----------+
|       1 | 5046213 |  4718592 |
+---------+---------+----------+

Note that the column thing was a nullable int column before.

update ratings set thing=id where id<1000000;
update ratings set thing=id where id>=1000000 and id<2000000;
update ratings set thing=id where id>=2000000 and id<3000000;
update ratings set thing=id where id>=3000000 and id<4000000;
update ratings set thing=id where id>=4000000 and id<5100000;
select count(*) from ratings where thing!=id;
-- 0 rows

ALTER TABLE ratings MODIFY COLUMN thing int not null;

-- current table definition (after above ALTER):
CREATE TABLE `ratings` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `thing` int(11) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5046214 DEFAULT CHARSET=utf8;

And then the Explain that is a Tablescan (against column thing):

enter image description here

Community
  • 1
  • 1
Drew
  • 24,851
  • 10
  • 43
  • 78
  • Thanks for the detailed answer, although if it comes down to this then I may as well do it in application-layer prepared statements. I was hoping to avoid injection attacks (if `SELECT ... WHERE x in $list` could be done such that `$list` could *only* be interpreted as a set of data, not as an arbitrary string, then that would have been ideal). – spraff Jun 24 '16 at 12:33
  • Yes, my example was a deliberate one from pre-saved data for 1000 random questions. Pre-computed and housed and ready to be marked as 'used' before. Would cut down the presentation by a factor of at least 100x. But it shows that the index is used is the point. I will try to improve the bottom of it warning against SQL Injection. – Drew Jun 24 '16 at 12:35
  • Parse $list, make sure each component is just digits, then construct the list for `IN(...)`. In this particular usage, checking $list for the regular expression `"^[ ,0-9]+$"` should suffice. – Rick James Sep 03 '16 at 04:13
  • @RickJames do you or any peers review the above? – Drew Sep 03 '16 at 04:58
  • "peer review of my suggestion"? A few dozen people have "viewed" this thread, a few may have read my suggestion. It's a money-back guarantee; the problem is you paid $0 for the suggestion. – Rick James Sep 03 '16 at 05:31
0

You can use following technique to use primary index.

Prerequisities:

  1. You know the maximum amount of items in comma separated string and it is not large

Description:

  1. we convert comma separated string into temporary table
  2. inner join to the temporary table
select @ids:='1,2,3,5,11,4', @maxCnt:=15;
SELECT * 
FROM foo 
INNER JOIN (
    SELECT * FROM (SELECT  @n:=@n+1 AS n FROM foo INNER JOIN (SELECT @n:=0) AS _a) AS _a WHERE _a.n <= @maxCnt
) AS k ON k.n <= LENGTH(@ids) - LENGTH(replace(@ids, ',','')) + 1
AND id = SUBSTRING_INDEX(SUBSTRING_INDEX(@ids, ',', k.n), ',', -1)

This is a trick to extract nth value in comma separated list:

SUBSTRING_INDEX(SUBSTRING_INDEX(@ids, ',', k.n), ',', -1)

Notes: @ids can be anything including other column from other or the same table.

0

In your question, what would be indexed is the id, not the SET.

If you were looking to index a SET column, that is currently (v8.1) not supported.

However, as of v8.0.20 you CAN create a JSON array instead of a SET, and add an index to the JSON Array column.

To convert an existing set:

update <table> set <json_arr> = concat('["',replace(<set_col>, ',', '","'),'"]');

Than add an index for the new row:

alter table <json_array> add index <json_ind> ((cast(<json_arr> AS char(32) ARRAY)) );

You then query using JSON methods, eg:

  • WHERE JSON_CONTAINS(<json_arr>, '"val"')
  • WHERE JSON_OVERLAPS(<json_arr>, CAST('["val1","val2"]' AS JSON))
SamGoody
  • 13,758
  • 9
  • 81
  • 91