0

I need to split a string. I have a regex able to match each substring entirely.

I tried using it with String.prototype.matchAll() and it's able to split , but that function accepts "invalid tokens" too: pieces of the string that don't match my regex. For instance:

var re = /\s*(\w+|"[^"]*")\s*/g  // matches a word or a quoted string
var str = 'hey ??? "a"b'         // the '???' part is not a valid token
var match = str.matchAll(re)
for(var m of match){
  console.log("Matched:", m[1])
}

Gives me the token hey, "a" and b. Those are indeed the substrings that match my regex, but I would have wanted to get an error in this case, since string contains ??? which is not a valid substring.

How can I do this?

Blue Nebula
  • 932
  • 4
  • 9
  • is the space important to the match if you are accepting `*`? The match will not throw an error, it will find all occurrences of your grouping. if you want to validate your string by a regular expression, you are probably looking for `re.test(str)` – async await Sep 30 '21 at 15:59
  • @asyncawait: the spaces that separate two tokens are optional. I don't really care about matching those (and in fact don't capture them), but the regex contains them because it's meant to match the whole string entirely in sequential steps, without skipping any character. I'm not sure how to use `re.test(str)` in this case... Unless you're suggesting to build a new regex that matches the given one N times (`/^(\s*(\w+|"[^"]*")\s*)*$/` for the example)... It seems a bit of a pain to build such regex, so I'm wondering if other solutions exist? – Blue Nebula Sep 30 '21 at 16:03
  • The problem is with test it will pass if any of the string is a pass. You could use `.replace(`with the regular expression, a global flag, and replace with an empty string. then if it still has length, you know you have invalid characters. If its a big string, you could build an expression for the invalid characters and test for them. – async await Sep 30 '21 at 16:08
  • I can't easily create a regex for invalid character: in the real case it's not just about characters, but there's a bit of context involved; I can handle it with matching regex, but not with "invalid matching" ones. The idea of replacing everything that matches my token with the empty string and checking the final length is a good one. I'll go with it if nothing better can be done – Blue Nebula Sep 30 '21 at 16:10
  • `const isValid = (str.match(re).length === str.split(re).filter(s => s !== '').length)` – Peter Seliger Sep 30 '21 at 19:09

1 Answers1

1

The /\s*(\w+|"[^"]*")\s*/g regex is used to extract multiple pattern matches from a string, it is not meant to validate a string.

If you need to return true or false, you need a regex for validation that has the following properties:

So, in your case, use the two-step approach:

  • Validate the string with /^\s*(?:(?:\w+|"[^"]*")\s*)*$/.test(text) first and then
  • If there is a match, extract the matches using your code, or a bit more enhanced one, const matches = text.match(/\w+|"[^"]*"/g).

See the JavaScript demo:

var extraction_re = /\w+|"[^"]*"/g;
var validation_re = /^\s*(?:(?:\w+|"[^"]*")\s*)*$/;
for (var text of ['hey "a"b', 'hey ??? "a"b']) {
    if (validation_re.test(text)) {
        console.log("Matched:", text.match(extraction_re))
    } else {
        console.log(text, "=> No Match!")
    }
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563