The case sensitivity is determined by the collation on the column. Unless you have a specific reason for using a case sensitive collation, I suggest changing it to case insensitive to do away with the need to explicitly force the case. Instead of applying another function to every word you can just set the match type to case-insensitive for REGEXP_INSTR. You can also move the prefix check into the regexp -
SELECT word
FROM words
WHERE REGEXP_INSTR(word, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
Of course, the above query cannot use any available index to filter so moving the prefix to the regex is not such a great idea. This lead me to do some tests. I took a simplified copy of my dictionary to create the following table (111,745 rows) -
CREATE TABLE `words` (
`word_cs` varchar(25) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_cs NOT NULL,
`word_ci` varchar(25) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NOT NULL,
KEY `idx_word_ci` (`word_ci`),
KEY `idx_word_cs` (`word_cs`)
) ENGINE=InnoDB;
The first batch of queries I ran against the case-sensitive word_cs
and I ran each query five times and the timings are the averages -
SELECT word_cs
FROM words
WHERE REGEXP_INSTR(word_cs, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.166s */
SELECT word_cs
FROM words
WHERE REGEXP_INSTR(lower(word_cs), '(?=^ab)(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.193s */
SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND REGEXP_INSTR(word_cs, '(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.067s */
SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND REGEXP_INSTR(lower(word_cs), '(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.065s */
SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND INSTR(lower(word_cs), 'o' ) > 0
AND INSTR(lower(word_cs), 'd' ) > 0;
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.064s */
SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND lower(word_cs) LIKE '%o%'
AND lower(word_cs) LIKE '%d%';
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.063s */
I then ran a similar (slightly modified due to lack of case sensitivity) batch of queries against the case-insensitive word_ci
-
SELECT word_ci
FROM words
WHERE REGEXP_INSTR(word_ci, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.147s */
SELECT word_ci
FROM words
WHERE REGEXP_INSTR(word_ci, '(?=^ab)(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.157s */
SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND REGEXP_INSTR(word_ci, '(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.003s */
SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND REGEXP_INSTR(word_ci, '(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.003s */
SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND INSTR(word_ci, 'o' ) > 0
AND INSTR(word_ci, 'd' ) > 0;
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.001s */
SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND word_ci LIKE '%o%'
AND word_ci LIKE '%d%';
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.001s */
|
word_cs |
word_ci |
Query 1 |
0.166 |
0.147 |
Query 2 |
0.193 |
0.157 |
Query 3 |
0.067 |
0.003 |
Query 4 |
0.065 |
0.003 |
Query 5 |
0.064 |
0.001 |
Query 6 |
0.063 |
0.001 |
The real difference in performance is not down to the case-sensitivity of the collation, or lack thereof, but whether the query is sargable or not. The use of LOWER() on the column value negates index use.
/* Full table scan */
SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
/* Uses index if available */
SELECT word_cs
FROM words
WHERE word_cs LIKE 'ab%'
OR word_cs LIKE 'AB%'
OR word_cs LIKE 'Ab%'
OR word_cs LIKE 'aB%';