Preface:
I've done quite a bit of (re)searching on this, and found the following SO post/answer: https://stackoverflow.com/a/5361490/6095216 which was pretty close to what I'm looking for. The same code, but with somewhat more helpful comments, appears here: http://thenoyes.com/littlenoise/?p=136 .
Problem Description:
I need to split 1 column of MySQL TEXT data into multiple columns, where the original data has this format (N <= 7):
{"field1":"value1","field2":"value2",...,"fieldN":"valueN"}
As you might guess, I only need to extract the values, putting each one into a separate (predefined) column. The problem is that the number and order of the fields is not guaranteed to be the same for all records. Thus, solutions using SUBSTR/LOCATE, etc. don't work, and I need to use regular expressions. Another restriction is that 3rd party libraries such as LIB_MYSQLUDF_PREG (suggested in the answer from my 1st link above) cannot be used.
Solution/Progress so far:
I've modified the code from the above links such that it returns the first/shortest match, left-to-right; otherwise, NULL is returned. I also refactored it a bit and made the identifiers more reader/maintainer-friendly :) Here's my version:
CREATE FUNCTION REGEXP_EXTRACT_SHORTEST(string TEXT, exp TEXT)
RETURNS TEXT DETERMINISTIC
BEGIN
DECLARE adjustStart, adjustEnd BOOLEAN DEFAULT TRUE;
DECLARE startInd INT DEFAULT 1;
DECLARE endInd, strLen INT;
DECLARE candidate TEXT;
IF string NOT REGEXP exp THEN
RETURN NULL;
END IF;
IF LEFT(exp, 1) = '^' THEN
SET adjustStart = FALSE;
ELSE
SET exp = CONCAT('^', exp);
END IF;
IF RIGHT(exp, 1) = '$' THEN
SET adjustEnd = FALSE;
ELSE
SET exp = CONCAT(exp, '$');
END IF;
SET strLen = LENGTH(string);
StartIndLoop: WHILE (startInd <= strLen) DO
IF adjustEnd THEN
SET endInd = startInd;
ELSE
SET endInd = strLen;
END IF;
EndIndLoop: WHILE (endInd <= strLen) DO
SET candidate = SUBSTRING(string FROM startInd FOR (endInd - startInd + 1));
IF candidate REGEXP exp THEN
RETURN candidate;
END IF;
IF adjustEnd THEN
SET endInd = endInd + 1;
ELSE
LEAVE EndIndLoop;
END IF;
END WHILE EndIndLoop;
IF adjustStart THEN
SET startInd = startInd + 1;
ELSE
LEAVE StartIndLoop;
END IF;
END WHILE StartIndLoop;
RETURN NULL;
END;
I then added a helper function to avoid having to repeat the regex pattern, which, as you can see from above, is the same for all the fields. Here is that function (I left my attempt to use a lookbehind - unsupported in MySQL - as a comment):
CREATE FUNCTION GET_MY_FLD_VAL(inputStr TEXT, fldName TEXT)
RETURNS TEXT DETERMINISTIC
BEGIN
DECLARE valPattern TEXT DEFAULT '"[^"]+"'; /* MySQL doesn't support lookaround :( '(?<=^.{1})"[^"]+"'*/
DECLARE fldNamePat TEXT DEFAULT CONCAT('"', fldName, '":');
DECLARE discardLen INT UNSIGNED DEFAULT LENGTH(fldNamePat) + 2;
DECLARE matchResult TEXT DEFAULT REGEXP_EXTRACT_SHORTEST(inputStr, CONCAT(fldNamePat, valPattern));
RETURN SUBSTRING(matchResult FROM discardLen FOR LENGTH(matchResult) - discardLen);
END;
Currently, all I'm trying to do is a simple SELECT query using the above code. It works correctly, BUT IT. IS. SLOOOOOOOW... There are only 7 fields/columns to split into, max (not all records have all 7)! Limited to 20 records, it takes about 3 minutes - and I have about 40,000 records total (not very much for a database, right?!) :)
And so, finally, we get to the actual question: [how] can the above algorithm/code (pretty much a brute search at this point) be improved SIGNIFICANTLY performance-wise, such that it can be run on the actual database in a reasonable amount of time? I started looking into the major known pattern-matching algorithms, but quickly got lost trying to figure out what would be appropriate here, in large part due to the number of available options and their respective restrictions, conditions for use, etc. Plus, it seems like implementing one of these in SQL just to see if it would help, might be a lot of work.
Note: this is my first post ever(!), so please let me know (nicely) if something is not clear, etc. and I will do my best to fix it. Thanks in advance.