1

I need to parse a string that's a comma-separated lists of parameters of the form

key1=value1,key2=value2,key3=value3...

The complication is that the values can be enclosed in quotation marks to allow them to contain spaces and commas and such. Of course the quotation-mark enclosed commas should not count as separating parameters. (There can also be spaces in various places outside the quotation marks that maybe should be ignored.)

My thought is to split the list at the commas, and then inside each parameter definition, to separate the key from the value at the equal sign. So to split the parameters, I need to find the valid ones (not in quotation marks); I'm thinking a regular expression is the way to go for brevity and directness.

Here are some example strings:

  • Include="All Violations", CheckType=MaxTrans
  • MetricName = PlacedInstances, PlacedOnly = 1
  • CheckType=Hold, Include="reg2reg,in2reg,in2out,reg2out"
  • CheckType=Setup, Include="reg2reg,in2reg,in2out,reg2out (sic)

Yes, the last one is poorly formed: missing a terminating quotation mark in the value.

I found this answer helpful (regex: /,(?=(?:(?:[^"]*"){2})*[^"]*$)/), except for parsing the poorly formed one. In my case I've got additional information in the equal sign, which would allow parsing that one.

I tried this: (/(?<==[^"]+),/, which works for the poorly formed one, but fails my first example. I think what I need is a way to find commas preceded by an equal sign but which have either zero or two quotation marks (not just a single quotation mark) between them and the first preceding equal sign. But how do I write that in Javascript Regex?

JohnK
  • 6,865
  • 8
  • 49
  • 75

3 Answers3

0

Something like this would work:

/(?:^|, *)(?<key>[a-z]+) *= *(?<value>[^\r\n,"]+|"[^\r\n"]+"?)/gmi

https://regex101.com/r/z05WcM/1

  • (?:^|, *)(?<key>[a-z]+) name a capture group "key" which is defined as a sequence of alpha chars which are either at the start of the line or after a comma and optional space
  • *= * - the assignment operator (equal sign) can have spaces on either side
  • (?<value>[^\r\n,"]+|"[^\r\n"]+"?) - name a capture group as "value" which is either a non comma and non quote containing string or if it starts with a quote then it can have commas with an optional closing quote

But if you have data like Include="All Viola\"tions" then it will fail.

Do note that I avoided using lookbehinds because they are not universally supported in all browsers.

MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77
0

One could use an approach which is based on e.g. two regular expressions ...

  1. /,\s*(?=[^=,]+=)/
  2. /^(?<key>[^=\s]+)\s*="*(?<value>[^"]+)/

The first one has to split the provided string according to the OP's requirements; thus it is based on a positive lookahead.

The second one will be used within an operation which does map the resulting array of parameter template items. Each item will be processed by a regex which tries to capture named groups. In addition the string value of a group's value field will be trimmed.

// see ... [https://regex101.com/r/nUc8en/1/]
const regXParameterSplit = (/,\s*(?=[^=,]+=)/);

// see ... [https://regex101.com/r/7xSwyX/1/]
const regXCaptureKeyValue = (/^(?<key>[^=\s]+)\s*="*(?<value>[^"]+)/);

const testSample = 'Include="All Violations", CheckType=MaxTrans, MetricName = PlacedInstances, PlacedOnly = 1, CheckType=Hold, Include="reg2reg,in2reg,in2out,reg2out", CheckType=Setup, Include="reg2reg,in2reg,in2out,reg2out,CheckType=Setup';

function getKeyAndValue(template) {
  const { groups } = (regXCaptureKeyValue.exec(template) || {});
  if (groups) {
    groups.value = groups.value.trim();
  }
  return groups;
}

console.log(
  '... just splitting ...',
  testSample
    .split(regXParameterSplit)
);
console.log(
  '... the full approach ...',
  testSample
    .split(regXParameterSplit)
    .map(getKeyAndValue)
);
.as-console-wrapper { min-height: 100%!important; top: 0; }
Peter Seliger
  • 11,747
  • 3
  • 28
  • 37
  • @JohnK ... are there any questions regarding the above approach? – Peter Seliger Feb 11 '21 at 08:13
  • Are lookaheads supported in all browsers? – JohnK Feb 11 '21 at 15:08
  • @JohnK ... according to SO users [*"...lookahead has been supported from the beginning...*](https://stackoverflow.com/questions/18462467/javascript-support-of-lookaheads-and-lookbehinds-in-regular-expressions). If one additionally checks either [*MDN*](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#browser_compatibility) or [*caniuse*](https://caniuse.com/?search=RegExp%20look) one only gets displayed issues with *lookbehind assertions*. – Peter Seliger Feb 11 '21 at 17:12
0

Use

string.match(/\w+\s*=\s*(?:"[^"\n]*(?:"|$)|\S+(?=,|$))/g)

See proof.

Explanation

--------------------------------------------------------------------------------
  \w+                      word characters (a-z, A-Z, 0-9, _) (1 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  =                        '='
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    "                        '"'
--------------------------------------------------------------------------------
    [^"\n]*                  any character except: '"', '\n'
                             (newline) (0 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      "                        '"'
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      $                        before an optional \n, and the end of
                               the string
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    \S+                      non-whitespace (all but \n, \r, \t, \f,
                             and " ") (1 or more times (matching the
                             most amount possible))
--------------------------------------------------------------------------------
    (?=                      look ahead to see if there is:
--------------------------------------------------------------------------------
      ,                        ','
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      $                        before an optional \n, and the end of
                               the string
--------------------------------------------------------------------------------
    )                        end of look-ahead
--------------------------------------------------------------------------------
  )                        end of grouping
Ryszard Czech
  • 18,032
  • 4
  • 24
  • 37
  • Not bad. But it appears to capture the commas that should be the dividers when there's no closing quotation mark. This means I'd have to include a separate step to remove those commas. – JohnK Feb 11 '21 at 15:03
  • @JohnK I think I managed to address this "bug", please check. – Ryszard Czech Feb 11 '21 at 20:52