1

I've read quite a few similar posts but none solves my case, which could well be because of my lack of sufficient knowledge, so please bear with me.

One of the search options in my terminological dictionary is "whole words only". At first I was using

WHERE ".$source." RLIKE '[[:<:]]".$keyword."[[:>:]]'

However, this failed to match whole words for the first or second $keyword when there is more than one. Then I found

WHERE ".$source." REGEXP '[[:<:]]".$keyword."[[:>:]]'

and

WHERE ".$source." REGEXP '(^| )".$keyword."( |$)'

while searching these forums

I just tested both of the above in my PhpMyAdmin and found out that the former executes in 0.0740 seconds, while the latter takes twice as long, 0.1440 seconds, so I guess I should stick with the former.

What bothers me the most is the huge discrepancy in results, e.g. searching for a single word ("tool"):

  1. Using the [[:<:]] and [[:>:]] word boundary in PhpMyAdmin returns 34 results.

  2. Using (^| ) and ( |$) in PhpMyAdmin returns 26 results.

  3. Running the #1 regexp in my PHP script returns 34 results (this is the correct number).

Here's the whole MySQL block:

foreach($keywords as $keyword) {
    $query = $db->query("SELECT * FROM ".DICTIONARY_TABLE." " .
    "JOIN ".DICTIONARY_THEMES." ON ".DICTIONARY_TABLE.".theme_id = ".DICTIONARY_THEMES.".theme_id ".
    "LEFT JOIN ".DICTIONARY_DEFINITIONS." ON ".DICTIONARY_TABLE.".term_id = ".DICTIONARY_DEFINITIONS.".term_id ".
    "WHERE ".DICTIONARY_TABLE.".".$source." REGEXP '(^| )".$keyword."( |$)'".
    //"WHERE ".DICTIONARY_TABLE.".".$source." REGEXP '[[:<:]]".$keyword."[[:>:]]'".
    " ORDER BY ".DICTIONARY_TABLE.".theme_id, ".DICTIONARY_TABLE.".".$source."");
}

I've commented out the search option I'm not using.

Now, if I try TWO keywords, e.g. "cutting tool", I still get 34 results in the page. I'm unsure if I'm doing this right in PhpMyAdmin:

SELECT * FROM `asphodel_dictionary_terms` WHERE english REGEXP '[[:<:]]cutting[[:>:]]';
SELECT * FROM `asphodel_dictionary_terms` WHERE english REGEXP '[[:<:]]tool[[:>:]]'

This returns 44 results for "cutting" and 34 results for "tool". The query using (^| )... returns 37 + 26 results respectively.

Any feedback that would help me sort things out would be appreciated!

The database contains a total of 109,000 entries in the main table, there are 82 themes in the DICTIONARY_THEMES table and 727 entries in the DICTIONARY_DEFINITIONS table. Not a huge database and it won't grow much bigger.

GMB
  • 216,147
  • 25
  • 84
  • 135
cheeseus
  • 369
  • 3
  • 21
  • `keyword( |$)` doesn't match `keyword.` or `keyword,`. – Paul Spiegel Feb 23 '19 at 10:01
  • Will `keyword(\.|,| |$)` do? I'm not good at regex, sorry if this is a silly question. – cheeseus Feb 23 '19 at 10:58
  • I'm not a regex expert. Add the [regex] tag to your question and hope that @WiktorStribiżew will find it :-) – Paul Spiegel Feb 23 '19 at 11:07
  • There is no problem in this post. You are comparing apples to oranges since the two regexps are meant to match words in different contexts. One is based on word boundaries and the other uses whitespace boundaries. It is up to you to decide what you need and only after that you can try to optimize. In MySQL, you won't be able to optimize much though, its regex engine is too "basic". – Wiktor Stribiżew Feb 23 '19 at 13:58

1 Answers1

1

You are getting different results because the two regexes are not identical.

(^| ) means : either the beginning of the string or a space (( |$) has the same meaning at end of string).

[[:<:]] and [[:>:]] are word boundaries : conceptually this refers to characters that separate words, and usually regex engines interpret it as something like : anything but a digit, a letter or an underscore.

So basically the first pattern is more restrictive than the second (space, beginning and end of string are word boundaries, but there are others).

If you have more than one keyword to search for, you would need to repeat the regex matches, like :

WHERE 
    ".$source." RLIKE '[[:<:]]".$keyword1."[[:>:]]'
    OR ".$source." RLIKE '[[:<:]]".$keyword2."[[:>:]]'

Or create a new regex by combining the keywords :

WHERE 
    ".$source." RLIKE '[[:<:]](".$keyword1.")|(".$keyword2.")[[:>:]]'

NB : for search requirement, you should consider using MySQL Full Text Search, which are primarily built for the purpose of searching for full words (there are pre-requisites, though).

GMB
  • 216,147
  • 25
  • 84
  • 135
  • 1
    Thanks for explaining the difference between the two regexes. I will stick to the word boundary one then – also because it is faster. However, I don't think I can do what you advise in the second part of your post because I take the `$keywords` array and loop it in a `foreach`, that is, I only have `$keyword`. Plus, this method will never return matches for all keywords if present in the same entry, e.g. when I search for "stop loss" this will return entries containing "stop" or "loss" but not "stop loss [order]". – cheeseus Feb 23 '19 at 14:05
  • Just noticed something odd: the word boundary treats hyphenated words as separate words, e.g. `WHERE english REGEXP '[[:<:]]stop[[:>:]]' AND english REGEXP '[[:<:]]loss[[:>:]]'` will return "stop-loss treaty" and "stop-loss order". – cheeseus Feb 23 '19 at 14:12
  • @cheeseus : yes that’s what I would expect, regexp engines usually work this way – GMB Feb 23 '19 at 14:30