0

I'm trying to extract the volume of a string field to sort by it numerically.


Given the following data:

  1. Something at 300 ml
  2. 300Ml somthing
  3. Something special (with 300 Ml)
  4. 8-v something that should not match

First attempt: Just cast the string to unsigned (found in this thread: reference)

Problem: It'll obviously also take the 4th data example into account.


Second attempt: Use REGEXP_REPLACE to extract the value (found in this thread: reference)

CAST(
    REGEXP_REPLACE(
        LOWER(column),
        "[:digit:]+[:space:]*ml",
        "///"
    ) as UNSIGNED
) AS volume

...using the character_classes defined in the manual.

Problem: The character_classes seem not to work as intended (possibly an issue with the character class name for spaces?).


Third attempt: Using the common Regexp tokens

CAST(
    REGEXP_REPLACE(
        LOWER(column),
        "\d+\s*ml",
        "///"
    ) as UNSIGNED
) AS volume

...which seems to work better.

Problem: It also matches the 4th data example again.


Question: How do I perform some kind of positive lookahead in MySQL REGEXP_REPLACE that will only match data examples 1-3 (ignoring the case sensitiveness due to casting to lower)?

The one i've tried seem to work fine in PCRE (with global flag) but not in MySQL:

^.*\d+\s*ml.*$ (ignoring the missing capturing group)

Which is translated like: Match everything until 1 or more digits, followed by 0 or more whitespaces and the string ml is found.

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Y.Hermes
  • 449
  • 5
  • 22

2 Answers2

1

Try this:

       CAST(
          REGEXP_SUBSTR( 
                LOWER(@c), 
                "([[:digit:]]+)[[:space:]]*ml" )
          AS UNSIGNED)  AS VOLUME;

You definitely want REGEX_SUBSTR() rather than REGEX_REPLACE() for your purpose.

I've tried it on MySQL 8 and MariaDB 10.3 - 10.5. Doubling the brackets [[:digit:]] was required, for me anyhow. And if you use the older \s notation, you need to double up the \ escape character: \\s+

(MariaDb's implementation of REGEX_SUBSTR() does not accept the optional parameters that MySQL's does.

I have to say, having worked in health care IT, that combining regular expressions with drug dosages frightens me. Be careful! Test! Test! If you have patients, they will thank you.

O. Jones
  • 103,626
  • 17
  • 118
  • 172
  • As this is basically what I did aferwards, i'll accept your answer ;-) FYI: It's not for health care, it's for chemistry (basic research) :-p – Y.Hermes Sep 22 '20 at 09:03
0

The problem was the combination with casting the replaced result to unsigned.

After using:

REGEXP_SUBSTR(
    LOWER(article.name),
    "[:digit:]+[:space:]*ml"
) AS volume

instead of REGEXP_REPLACE and removing the unsigned casting (regardless of double or single brackets somehow) the REGEXP works fine.

Y.Hermes
  • 449
  • 5
  • 22