How to improve performance for REGEXP string matching in MySQL?

Question

Preface:

I've done quite a bit of (re)searching on this, and found the following SO post/answer: https://stackoverflow.com/a/5361490/6095216 which was pretty close to what I'm looking for. The same code, but with somewhat more helpful comments, appears here: http://thenoyes.com/littlenoise/?p=136 .

Problem Description:

I need to split 1 column of MySQL TEXT data into multiple columns, where the original data has this format (N <= 7):

{"field1":"value1","field2":"value2",...,"fieldN":"valueN"}

As you might guess, I only need to extract the values, putting each one into a separate (predefined) column. The problem is that the number and order of the fields is not guaranteed to be the same for all records. Thus, solutions using SUBSTR/LOCATE, etc. don't work, and I need to use regular expressions. Another restriction is that 3rd party libraries such as LIB_MYSQLUDF_PREG (suggested in the answer from my 1st link above) cannot be used.

Solution/Progress so far:

I've modified the code from the above links such that it returns the first/shortest match, left-to-right; otherwise, NULL is returned. I also refactored it a bit and made the identifiers more reader/maintainer-friendly :) Here's my version:

CREATE FUNCTION REGEXP_EXTRACT_SHORTEST(string TEXT, exp TEXT)
RETURNS TEXT DETERMINISTIC
BEGIN
    DECLARE adjustStart, adjustEnd BOOLEAN DEFAULT TRUE;
    DECLARE startInd INT DEFAULT 1;
    DECLARE endInd, strLen INT;
    DECLARE candidate TEXT;

    IF string NOT REGEXP exp THEN
        RETURN NULL;
    END IF;

    IF LEFT(exp, 1) = '^' THEN
        SET adjustStart = FALSE;
    ELSE
        SET exp = CONCAT('^', exp);
    END IF;
    IF RIGHT(exp, 1) = '$' THEN
        SET adjustEnd = FALSE;
    ELSE
        SET exp = CONCAT(exp, '$');
    END IF;

    SET strLen = LENGTH(string);
    StartIndLoop: WHILE (startInd <= strLen) DO
        IF adjustEnd THEN
            SET endInd = startInd;
        ELSE
            SET endInd = strLen;
        END IF;
        EndIndLoop: WHILE (endInd <= strLen) DO
            SET candidate = SUBSTRING(string FROM startInd FOR (endInd - startInd + 1));
            IF candidate REGEXP exp THEN
                RETURN candidate;
            END IF;
            IF adjustEnd THEN
                SET endInd = endInd + 1;
            ELSE
                LEAVE EndIndLoop;
            END IF;
        END WHILE EndIndLoop;
        IF adjustStart THEN
            SET startInd = startInd + 1;
        ELSE
            LEAVE StartIndLoop;
        END IF;
    END WHILE StartIndLoop;
    RETURN NULL;
END;

I then added a helper function to avoid having to repeat the regex pattern, which, as you can see from above, is the same for all the fields. Here is that function (I left my attempt to use a lookbehind - unsupported in MySQL - as a comment):

CREATE FUNCTION GET_MY_FLD_VAL(inputStr TEXT, fldName TEXT)
RETURNS TEXT DETERMINISTIC
BEGIN
    DECLARE valPattern TEXT DEFAULT '"[^"]+"'; /* MySQL doesn't support lookaround :( '(?<=^.{1})"[^"]+"'*/
    DECLARE fldNamePat TEXT DEFAULT CONCAT('"', fldName, '":');
    DECLARE discardLen INT UNSIGNED DEFAULT LENGTH(fldNamePat) + 2;
    DECLARE matchResult TEXT DEFAULT REGEXP_EXTRACT_SHORTEST(inputStr, CONCAT(fldNamePat, valPattern));
    RETURN SUBSTRING(matchResult FROM discardLen FOR LENGTH(matchResult) - discardLen);
END;

Currently, all I'm trying to do is a simple SELECT query using the above code. It works correctly, BUT IT. IS. SLOOOOOOOW... There are only 7 fields/columns to split into, max (not all records have all 7)! Limited to 20 records, it takes about 3 minutes - and I have about 40,000 records total (not very much for a database, right?!) :)

And so, finally, we get to the actual question: [how] can the above algorithm/code (pretty much a brute search at this point) be improved SIGNIFICANTLY performance-wise, such that it can be run on the actual database in a reasonable amount of time? I started looking into the major known pattern-matching algorithms, but quickly got lost trying to figure out what would be appropriate here, in large part due to the number of available options and their respective restrictions, conditions for use, etc. Plus, it seems like implementing one of these in SQL just to see if it would help, might be a lot of work.

Note: this is my first post ever(!), so please let me know (nicely) if something is not clear, etc. and I will do my best to fix it. Thanks in advance.

You're fighting a battle you can't win here. Regular expressions are always going to be slow, they can't be indexed. If that's a JSON column and you're not too married to MySQL, Postgres has *fantastic* support for JSON and deep indexes into the content. — tadman, Jun 20 '16 at 19:38
Or upgrade to 5.7.8 or later for native JSON support in MySQL https://dev.mysql.com/doc/refman/5.7/en/json.html — Matt Raines, Jun 20 '16 at 20:10
@tadman Yes, but even with regex, I'm pretty sure my solution can be improved. For example, I used a suggestion from here https://www.loggly.com/blog/five-invaluable-techniques-to-improve-regex-performance/ and changed the * quantifier to +, which made my query run almost 3 times faster (forgot to mention that in my question). My next thought was to try to use lookbehind, so that it doesn't have to start looking from the beginning of the string every time (yes, I know I said the field order is not guaranteed...) - but MySQL doesn't support it. — Yury, Jun 20 '16 at 20:14
@tadman I'd give you +1 for suggesting Postgres (as a temporary placeholder to split the data, before putting it back into MySQL), but I don't think my reputation is high enough yet. Although that seems like a pretty radical approach :) — Yury, Jun 20 '16 at 20:15
@MattRaines Good suggestion, thanks! Not sure if I can do that at this point, but definitely something to keep in mind. We're on 5.5 right now. — Yury, Jun 20 '16 at 20:18
I've had some very good experiences with Postgres + JSON, especially the new optimized `JSONB` column that offers improved performance and more compact storage. The indexing and query capabilities here compare favorably to document-stores like MongoDB. — tadman, Jun 20 '16 at 20:34

score 0 · Accepted Answer · edited May 23 '17 at 12:32

I was able to solve this by parsing the JSON, as suggested by tadman and Matt Raines above. Being new to the concept of JSON, I just didn't realize it could be done this way at all...a little embarrassing, but lesson learned!

Anyway, I used the get_option function in the common_schema framework: https://code.google.com/archive/p/common-schema/ (found through this post, which also demonstrates how to use the function: Parse JSON in MySQL ). As a result, my INSERT query took about 15 minutes to run, vs the 30+ hours it would've taken with the REGEXP solution. Thanks, and until next time! :)

score 0 · Answer 2 · answered Jun 26 '16 at 05:12

0

Don't do it in SQL; do it in PHP or some other language that has builtin tools for parsing JSON.

answered Jun 26 '16 at 05:12

Rick James

135,179
13
127
222

How to improve performance for REGEXP string matching in MySQL?

Preface:

Problem Description:

Solution/Progress so far:

2 Answers2