0

I'm trying to write a query that searches a list of words. One of the conditions is to check words that contain (in any order) a any of the characters in string.

For example the word must contain 'o' and 'd' in any order, so 'ABDOMEN' and 'ABOUND'are both correct.

My query is :

SELECT word
FROM words
WHERE lower(word) like 'ab%'                   /* Words starts with AB               */
AND   REGEXP_INSTR(lower(word), '[str]') = 0   /* does not contain any of r, s or t  */
AND   REGEXP_INSTR(lower(word), '[od]') > 0    /* must  contain both o and d         */

The problem is the 'must contain' condition, specifically getting it to check for both 'O' and 'D', whereas above it is more like 'O' or 'D'.

Experimenting, I found this works:

AND   REGEXP_INSTR(lower(word), '(o.*d|d.*o)' ) > 0    /* must  contain both o and d         */

The problem with this is I have to generate (in PHP) the (o.*d|d.*o) from the original od. When the list gets to over 3 chars generating this becomes arduous.

The other alternative is to add a separate condition for each character in the 'must contain' list:

AND   INSTR(lower(word), 'o' ) > 0    /* must  contain o          */
AND   INSTR(lower(word), 'd' ) > 0    /* must  contain d         */

However when using the bind_param call in PHP passing these in it makes the code messy.

Is there a 'one-liner' in MySQL that will achieve these above?

TenG
  • 3,843
  • 2
  • 25
  • 42

3 Answers3

1

The series of letters can be handled as in this example, and order of their appearance will be ignored:

WHERE REGEXP_INSTR(lower(word), '(?=.*O)(?=.*D)')

F001
  • 11
  • 1
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 23 '22 at 07:01
1

The case sensitivity is determined by the collation on the column. Unless you have a specific reason for using a case sensitive collation, I suggest changing it to case insensitive to do away with the need to explicitly force the case. Instead of applying another function to every word you can just set the match type to case-insensitive for REGEXP_INSTR. You can also move the prefix check into the regexp -

SELECT word
FROM words
WHERE REGEXP_INSTR(word, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');

Of course, the above query cannot use any available index to filter so moving the prefix to the regex is not such a great idea. This lead me to do some tests. I took a simplified copy of my dictionary to create the following table (111,745 rows) -

CREATE TABLE `words` (
  `word_cs` varchar(25) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_cs NOT NULL,
  `word_ci` varchar(25) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_as_ci NOT NULL,
  KEY `idx_word_ci` (`word_ci`),
  KEY `idx_word_cs` (`word_cs`)
) ENGINE=InnoDB;

The first batch of queries I ran against the case-sensitive word_cs and I ran each query five times and the timings are the averages -

SELECT word_cs
FROM words
WHERE REGEXP_INSTR(word_cs, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.166s */

SELECT word_cs
FROM words
WHERE REGEXP_INSTR(lower(word_cs), '(?=^ab)(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.193s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND REGEXP_INSTR(word_cs, '(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.067s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND REGEXP_INSTR(lower(word_cs), '(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.065s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND INSTR(lower(word_cs), 'o' ) > 0
AND INSTR(lower(word_cs), 'd' ) > 0;
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.064s */

SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'
AND lower(word_cs) LIKE '%o%'
AND lower(word_cs) LIKE '%d%';
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.063s */

I then ran a similar (slightly modified due to lack of case sensitivity) batch of queries against the case-insensitive word_ci -

SELECT word_ci
FROM words
WHERE REGEXP_INSTR(word_ci, '(?=^ab)(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.147s */

SELECT word_ci
FROM words
WHERE REGEXP_INSTR(word_ci, '(?=^ab)(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 111745 rows; Serverside execution time: 0.157s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND REGEXP_INSTR(word_ci, '(?=.*o)(?=.*d)', 1, 1, 0, 'i');
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.003s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND REGEXP_INSTR(word_ci, '(?=.*o)(?=.*d)');
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.003s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND INSTR(word_ci, 'o' ) > 0
AND INSTR(word_ci, 'd' ) > 0;
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.001s */

SELECT word_ci
FROM words
WHERE word_ci LIKE 'ab%'
AND word_ci LIKE '%o%'
AND word_ci LIKE '%d%';
/* Returned 58 rows; Examined 525 rows; Serverside execution time: 0.001s */
word_cs word_ci
Query 1 0.166 0.147
Query 2 0.193 0.157
Query 3 0.067 0.003
Query 4 0.065 0.003
Query 5 0.064 0.001
Query 6 0.063 0.001

The real difference in performance is not down to the case-sensitivity of the collation, or lack thereof, but whether the query is sargable or not. The use of LOWER() on the column value negates index use.

/* Full table scan */
SELECT word_cs
FROM words
WHERE lower(word_cs) LIKE 'ab%'

/* Uses index if available */
SELECT word_cs
FROM words
WHERE word_cs LIKE 'ab%'
OR word_cs LIKE 'AB%'
OR word_cs LIKE 'Ab%'
OR word_cs LIKE 'aB%';
user1191247
  • 10,808
  • 2
  • 22
  • 32
  • Thank you for this detailed answer. I will examine and try these things. The reason for the case handling is because the table can have entries like 'aboriginal' and 'Aboriginal' as separate rows, so collation was set to be case sensitive. Also the table has about 110,000 rows, so not huge enough to pose significant performance problems for the types of queries we expect. – TenG Jan 23 '22 at 09:12
  • Even with a case-insensitive collation you can still have multiple copies of the words with different capitalisation. And when desired you can still force a search to be case-sensitive. If the majority of your queries are going to be case-insensitive then you should switch to a _ci collation. – user1191247 Jan 23 '22 at 10:04
1

The Collation is important. lower(word_cs) is not "sargable", hence won't use the index.

This should speed up the previous answers:

WHERE word LIKE 'ab%'
  AND REGEXP_INSTR(word, '(?=.*o)(?=.*o)')

together with an index starting with word and have a _ci collation on word.

The hope is that it will use the INDEX to quickly fetch just the 'ab' words, then spend time applying regexp to only that subset, looking for 'o' and 'd'.)

Here's another thought (with the same index):

SELECT word
    FROM tbl
    WHERE word LIKE 'ab%o%'
      AND word LIKE 'ab%d%'

For "none of [str]", simply add on

     AND NOT word RLIKE '[str]'

Again, performance depends on the LIKE being done before the RLIKE.

Rick James
  • 135,179
  • 13
  • 127
  • 222