0

How does one capture a value recursively with regex, where value is a part of a group that repeats?

I have a serialized array in mysql database These are 3 examples of a serialized array

  1. a:2:{i:0;s:2:"OR";i:1;s:2:"WA";}

  2. a:1:{i:0;s:2:"CA";}

  3. a:4:{i:0;s:2:"CA";i:1;s:2:"ID";i:2;s:2:"OR";i:3;s:2:"WA";}

a:1 stands for array:{number of elements} then in between {} i:0 means element 0, i:1 means element 1 etc. then the actual value s:2:"CA" means string with length of 2

so I have 2 elements in first array, 1 element in the second and 4 elements in the last

I have this data in mysql database and I DO NOT HAVE an option to parse this with back-end code - this has to be done in mysql (10.0.23-MariaDB-log)

the repeating pattern is inside of the curly braces the number of repeats is variable (as in 3 examples each has a different number of repeating patterns), the number of repeating patterns is defined by the number at 3rd position (if that helps)

for the first example it's a:2: and so there are 2 repeating blocks:

i:0;s:2:"OR";

i:1;s:2:"WA";

I only care to extract the values in bold

So I came up with this regex ^a:(?:\d+):\{(?:i:(?:\d+);s:(?:\d+):\"(\w\w)\";)+}$

it captures the values I want all right but problem is it only captures the last one in each repeating group

so going back to the example what would be captured is

  1. WA
  2. CA
  3. WA

What I would want is

  1. OR|WA
  2. CA
  3. CA|ID|OR|WA

these are the language specific regex functions available to me:

https://mariadb.com/kb/en/library/regular-expressions-functions/

I don't care which one is used to solve the problem

Ultimately I need this in as sensible form that can be presented to the client e.g. CA,ID,OR or CA|ID|OR

Current thoughts are perhaps this isn't possible in a one liner, and I have to write a multi-step function where

  1. extract the repeating portion between the curly braces
  2. then somehow iterate over each repeating portion
  3. then use the regex on each
  4. then return the results as one string with separated elements
ambidexterous
  • 832
  • 12
  • 21
  • 1
    I don't think this is possible: https://stackoverflow.com/a/37004214/2928853 – jrook Oct 23 '18 at 02:01
  • Thank you I did check that article before. No direct way to do that by regex alone. That's why I was starting to think about other programming constructs available in mariadb - I know there is some combination that will get the job done. – ambidexterous Oct 23 '18 at 12:19
  • You have the power of a relational database, use it. Don't hide the useful info in JSON. – Rick James Oct 25 '18 at 18:48

1 Answers1

0

I doubt if such a capture is possible. However, this would probably do the job for your specific purpose.

REGEXP_REPLACE(
  REGEXP_REPLACE(
    REGEXP_REPLACE(str1, '^a:\\d+:\{', ''),
    'i:\\d+;s:\\d+:\"(\\w\\w)\";',
    '\\1,'
  ),
  '\,?\}$',
  ''
)

Basically, this works with the input string (or column) str1 like

  1. remove the first part
  2. replace every cell with the string you want
  3. remove the last 2 characters, ,}

and voila! You get a string CA,ID,OR.

Aftenote
It may or may not work well when the original array before serialised is empty (it depends how it is serialised).

Masa Sakano
  • 1,921
  • 20
  • 32
  • Thank you for your effort. Am I applying your solution wrong? I can't seem to get it to work. Here's an online experiment: https://dbfiddle.uk/?rdbms=mariadb_10.2&fiddle=1259c19b43942c272a87bc819b19aa5f – ambidexterous Oct 23 '18 at 12:15
  • there must be some catch how this works in mariadb, the simplest part of the above doesn't work either. https://dbfiddle.uk/?rdbms=mariadb_10.2&fiddle=765dc55e542c5e9e15ffce2f40067c79 Perhaps there's some difference in character escaping rules in mariadb? maybe the "\{" doesn't work – ambidexterous Oct 23 '18 at 12:26
  • Double escape like `\\d` was required for `\d` and `\w`. I have edited the answer accordingly. Hope it works on your MariaDB. – Masa Sakano Oct 23 '18 at 13:07
  • 1
    @ambidexterous Glad to hear it worked. Just a note: If a record is a serialised **empty** array, the function above may return something unexpected. If you change the last regexp into `\,?\}$` such a case too may be handled – or not, depending how the string would look like. Bear it in mind. – Masa Sakano Oct 24 '18 at 16:28
  • Yeah good point! In my case the database is NULL if there's nothing in there so I'm fine as is. But for completeness that should be added to the solution. I don't know what the array looks like serialized with no elements - my best guess `a:0:{}` I appreciate the concern! – ambidexterous Oct 26 '18 at 00:45