0

Given a string such as

1, 'str,ing', [1, 2, [3, 4, 5, 'str,ing']], 'st[rin,g]['

I want to split it based on commas, but excluding commas inside inner strings or square brackets. So I would like the output to be a list of

1

'str,ing'

[1, 2, [3, 4, 5, 'str,ing']]

st[rin,g]['

Closest I've gotten is with ,(?=(?:[^'"[\]]*['"[\]][^'"[\]]*['"[\]])*[^'"[\]]*$), but this doesn't realize that ] doesn't close a ' and such.

junvar
  • 11,151
  • 2
  • 30
  • 46
  • 2
    look up a csv parser. – Daniel A. White Apr 18 '18 at 17:06
  • 3
    Your string format appears to be hierarchical. If so, it will be impossible to construct a regex that will work on every edge case, and you should not rely on one that appears to work on "almost" everything. You need to write a proper lexical, recursive-descent parser to deal with hierarchical string formats. – Patrick Roberts Apr 18 '18 at 17:09
  • Don't use regex, but if you *must*: `'[^']*'|(\[(?:[^][]*|(?1))*])|[^,\s]+[^,]*` with the `regex` package (not the `re` package). You're matching rather than splitting. – ctwheels Apr 18 '18 at 17:10
  • Have you tried JSON.parse('[' + mystring + ']') – Shanimal Apr 18 '18 at 17:38
  • @Shanimal, that won't work, because the insides aren't necessarily valid json. e.g, i could have the string as `1, 2, garbage` – junvar Apr 18 '18 at 17:53
  • See also: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – user234461 Apr 19 '18 at 16:01

2 Answers2

3

Regex is a context-less language, meaning it has no way of parsing depth based logic (nested arrays, for instance). If you are dealing with malformed data, you will have to make some assumptions about the data, and manually step through the data.

Here is an example running under the assumption that every [ should have a matching ], and that {} are not special. (With a check for end of string to prevent runaway loops)

var str = "1, 'str,ing', [1, 2, [3, 4, 5, 'str,ing']], 'st[rin,g]['";
var start_index = 0;
var parts = [];

for(index=0; index<str.length; index++) {
  // Single quote blocks
  if(str.charAt(index) == "'") {
    while(str.charAt(++index) != "'"  && index < str.length);
  } else 
  // Double quote blocks
  if(str.charAt(index) == '"') {
    while(str.charAt(++index) != '"'  && index < str.length);
  } else
  // array blocks
  if(str.charAt(index) == '[') {
    var depth = 1;
    while(depth != 0 && index < str.length) {
      index++;
      if(str.charAt(index) == '[') depth++;
      if(str.charAt(index) == ']') depth--;
    }
  } else if(str.charAt(index) == ',') {
    parts.push(str.substring(start_index, index).trim());
    start_index = index+1;
  }
}
parts.push(str.substring(start_index).trim());

console.log(parts)
Tezra
  • 8,463
  • 3
  • 31
  • 68
  • this looks great, i'm going to build off this to handle escaped quotation marks inside strings (e.g. `"1, 'text\'moretext', 2"`) and brackets inside strings inside lists (e.g. `"1, ['two', ']', 'three'], 4"`) – junvar Apr 19 '18 at 00:46
  • There might be some edge cases that also throw this `IndexOutOfBounds` Exception. Can you help fix it? – Ninja Aug 29 '19 at 01:22
  • @Ninja Every index++ is paired with is str.length check... I changed the one while loop to move the index increment up to the same time as the str.length check, but aside from that, I would need an example of the edge case. I would recommend surrounding this with a try-catch and throwing out anything that it can't salvage. – Tezra Aug 29 '19 at 12:29
  • Thanks for the response. hmm strange. Im getting some strange exceptions my side. Like if the String's last character is a single or double quote. – Ninja Aug 29 '19 at 15:57
  • @Ninja The code in my answer is meant more as an example of what I'm talking about than as code to be used. Have you tried the code from Junvar's answer? (that is a more complete version that addresses the OP's needs, and should be much easier to adapt to your needs) – Tezra Aug 29 '19 at 17:03
1

Based on @Terza's answer above, added some logic to handle escaping quotes inside strings and brackets inside strings.

class ParamSplitter {
  constructor(string) {
    this.string = string;
    this.index = -1;
    this.startIndex = 0;
    this.params = [];
  }

  splitByParams() {
    let depth = 0;

    while (this.nextIndex() && (!this.atQuote() || this.skipQuote())) {
      let char = this.string[this.index];
      if (char === '[')
        depth++;
      else if (char === ']')
        depth--;
      else if (char === ',' && !depth) {
        this.addParam();
        this.startIndex = this.index + 1;
      }
    }

    this.addParam();
    return this.params;
  }

  findIndex(regex, start) { // returns -1 or index of match
    let index = this.string.substring(start).search(regex);
    return index >= 0 ? index + start : -1;
  }

  nextIndex() {
    this.index = this.findIndex(/[,'"[\]]/, this.index + 1);
    return this.index !== -1;
  }

  atQuote() {
    let char = this.string[this.index];
    return char === '"' || char === "'";
  }

  skipQuote() {
    let char = this.string[this.index];
    this.index = this.findIndex(char === '"' ? /[^\\]"/ : /[^\\]'/, this.index + 1) + 1;
    return this.index;
  }

  addParam() {
    this.params.push(this.string.substring(this.startIndex, this.index > 0 ? this.index : this.string.length).trim());
  }
}

let run = string => new ParamSplitter(string).splitByParams();
let input = "1, 'str,ing', [1, 2, [3, 4, 5, 'str,ing']], 'st[rin,g][', 'text\\'moretext', ['two', ']', 'three'], 4";
console.log(run(input));
junvar
  • 11,151
  • 2
  • 30
  • 46