0

In the below query, I'd like to find records that start with engineer . e.g. I'd like to pull back records with the description engineering

SELECT * FROM app.desc_test t
WHERE lower(t.desc) REGEXP '[[:<:]]engineer[[:>:]]';

The word boundaries are properly handling all special characters (i.e. commas, spaces, special characters, etc that are before and after), but I'm not sure how to write the Regex so that it starts with engineer.

Also, how would I make this say starts with OR ends with engineer.

CREATE TABLE desc_test (
  id int(11) NOT NULL AUTO_INCREMENT,
  desc varchar(1000) COLLATE utf8mb4_unicode_ci NOT NULL,
  PRIMARY KEY (id)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;

Edit
The value will be unknown/dynamic, so hardcoding any "ing" expression isn't the solution.

Bill Karwin
  • 538,548
  • 86
  • 673
  • 828
contactmatt
  • 18,116
  • 40
  • 128
  • 186
  • `REGEXP '[[:<:]]engineer(ing)?[[:>:]]'`? Or `REGEXP '[[:<:]]engineer[[:alpha:]]*[[:>:]]'`? Sorry, when you say "must start with", do you mean a word in a string should start with it, or the whole string should start with `engineer`? – Wiktor Stribiżew Jan 21 '19 at 17:26
  • 1
    You could omit the word-boundary pattern at the end: `'[[:<:]]engineer'` But keep in mind that the word-boundary patterns will not work on MySQL 8.0 because they have changed their regexp implementation. Just giving you a heads up in case you intend to upgrade, ever. – Bill Karwin Jan 21 '19 at 17:26
  • Thanks @WiktorStribiżew, should clarify that the string will be coming in dynamically. Will make an edit. – contactmatt Jan 21 '19 at 17:27
  • 1
    What you ultimately need is a `FULLTEXT` index that supports stemming, so it matches common English suffixes. But MySQL's FULLTEXT doesn't support stemming yet. There's a worklog associated with this feature, but it hasn't been begun yet: https://dev.mysql.com/worklog/task/?id=2423 – Bill Karwin Jan 21 '19 at 17:29
  • Thanks @BillKarwin, this is good insight. – contactmatt Jan 21 '19 at 17:36
  • You'll be happier with some kind of fulltext index, because using `REGEXP` is thousands of times slower than an indexed solution. See my presentation [Full Text Search Throwdown](https://www.slideshare.net/billkarwin/practical-full-text-search-with-my-sql). You might need to move to some other database, or use a specialized search technology like ElasticSearch, which [supports stemming](https://www.elastic.co/guide/en/elasticsearch/guide/current/stemming.html). – Bill Karwin Jan 21 '19 at 17:40

3 Answers3

2

If you only want to match the beginning of the word, you can just remove [[:>:]] from the regexp.

SELECT * FROM app.desc_test t
WHERE lower(t.desc) REGEXP '[[:<:]]engineer';
Barmar
  • 741,623
  • 53
  • 500
  • 612
1

Note: Full Text Search as referenced by Bill Karwin is preferred

because using REGEXP is thousands of times slower than an indexed solution

But...

To use your current REGEXP implementation, your MySQL should look like this:

SELECT * FROM app.desc_test t WHERE lower(t.desc) 
REGEXP '[[:<:]]engineer[a-z]*[[:>:]]';

The Regex looks like this:

[[:<:]]engineer[a-z]*[[:>:]]

Meaning:

[[:<:]] - Start of word boundary
engineer - The string given by the search (dynamic)
[a-z] - any character between a-z between zero and any number of times.
* - The above "group" between zero and any number of times.
[[:>:]] - End of word boundary

The above should do what you need. You can also customise it for instance to include digits ((a-z0-9)), or whatever you wish.


Revisions to this answer:

One:

Revised, Improved: use [[:alpha:]] so:

[[:<:]]engineer[[:alpha:]]*[[:>:]]

Two:

As correctly pointed out by Barmar there is actually little need for excess REGEXP. Your word boundaries, or lack thereof, does your work for you.

Therefore to select any word beginning with engineer or ending with engineer, you simply make a REGEXP OR statement :

SELECT * FROM app.desc_test t WHERE lower(t.desc) 
REGEXP '([[:<:]]engineer)|(engineer)[[:>:]])' 

This means:

Return true if:

  • The term engineer comes at the start of a word, regardless of what comes after it.
  • OR the term engineer comes at the end of a word, regardless of what comes before it.

This should fit exactly what you're looking for. This has been tested on MySQL 5.7.


Sources :

Example cases:

Engineer

Match

Engineering

Match

Engineers

Match

Engineer!

Match


Also, how would I make this say starts with OR ends with engineer.

Simply flip around the REGEXP and set it as an OR statement:

SELECT * FROM app.desc_test t WHERE lower(t.desc) 
REGEXP '[[:<:]](engineer[[:alpha:]]*)|([[:alpha:]]*engineer)[[:>:]]';

Which tells the REGEXP to:
"look for engineer at the beginning of the word followed by any a-z values or look for any a-z values followed by engineer at the end of the word".

Martin
  • 22,212
  • 11
  • 70
  • 132
  • Same effect, but simpler: `'[[:<:]](engineer[[:alpha:]]*[[:>:]]'`. (Oh, `engineer123` might not work the same.) – Rick James Jan 21 '19 at 21:29
1

For "desc starts with":

"Starts with:

REGEXP:  '^engineer...'
LIKE:    'engineer%...'

Case folding:

If the collation of the column is `..._ci`, then do _not_ waste time with `LOWER()`.

So, this is optimal for finding desc that starts with "engineer" or "engineering" or "Engineer", etc:

   WHERE t.desc LIKE 'engineer%'

If you really meant "where desc contains 'engineer' or ...", then

   WHERE t.desc REGEXP '[[:<:]]engineer'

But a better way would be to use FULLTEXT(desc) and use this; it allows the word to be anywhere in desc and desc can be TEXT.

   WHERE MATCH(desc) AGAINST('+engineer*' IN BOOLEAN MODE)

You must pick among the choices based on the actual requirements. Meanwhile, here is the relative performance of them:

  • LOWER(desc) ... -- poor, regardless of the rest of the clause
  • LIKE 'engineer%' -- excellent if you have INDEX(desc)
  • LIKE 'engineer%' -- poor with no index, or with prefixing: INDEX(desc(100))
  • MATCH... -- excellent due to FULLTEXT index.
  • REGEXP ... -- poor; will check every record

For "there is a word that starts or ends with":

You need to list positive and negative test cases:

engineering blah
The engineer.
MechanicalEngineering  -- neither starts nor ends at word boundary??
engineer

If all of those are valid, then this is the only viable answer:

    WHERE t.desc LIKE '%engineer%'

The equivalent REGEXP 'engineer' is slower (but has the same effect).

For other situations, I would look at something close to

   WHERE t.desc REGEXP '[[:<:]]engineer|engineer[[:>:]]'

which looks for a "word" that starts or ends with 'engineer'. Note that this does not include 'MechanicalEngineering'.

Rick James
  • 135,179
  • 13
  • 127
  • 222