0

This is quite a simple regex, but I can't get my head around how I'd expand this regex so that it would allow me to use my delimiter character as long as it is escaped in the string. Here's what I have:

// Contents of str is exactly '|1|2|\|Three and Four\||5'
str.match(/[^|]/);

// Looking for: ['1', '2', '|Three and Four|', '5']

So currently my regex selects everything that isn't a | character and I get an array of each item. But what I want to do is to ignore the | character as a separator if it has been escaped first with \, but of course I don't want the \ to come through.

I know this'll be marked as a duplicate of the billion other regex questions, but I've tried to apply other solutions on here to my own, and played around with regex101.com. Alas, my Regex Fu is not strong.

P.s. Anyone know of any good resources to learn JS flavoured regex?

user2864740
  • 60,010
  • 15
  • 145
  • 220
Hexodus
  • 369
  • 6
  • 18
  • @charlietfl The `|` in that column are escaped (`\|`) and are thus *not* separators. Consider the variant (where the escaped separators aren't nestled up against the normal separators): `|1|hello\|happy\|world|2` -> `'1', 'hello|happy|world', '2'`. – user2864740 Apr 07 '18 at 23:17
  • 1
    `var str = '|1|2|\|Three and Four\||5';`equals to `var str = '|1|2||Three and Four||5';` in js – xianshenglu Apr 07 '18 at 23:19
  • @xianshenglu @user2864740 That's right, didn't think to make that clear. It's a data stream that I've cast to a string so that I can manipulate it and access each item between the `|`, but some of the items include `|` that shouldn't be treated as seperators. – Hexodus Apr 07 '18 at 23:25
  • 1
    Possible duplicate of [RegEx needed to split javascript string on "|" but not "\|"](https://stackoverflow.com/questions/12754321/regex-needed-to-split-javascript-string-on-but-not) (there, found something :D) – user2864740 Apr 07 '18 at 23:46
  • Note: the duplicate still leaves in (actually asks for such) the \| -- this can be corrected by replacing the \| in resulting components after the split. – user2864740 Apr 07 '18 at 23:52

3 Answers3

2

This should do it:

var str =  '|1|2|\\|Three and Four\\||5';
str.match(/((\\\|)|[^|])+/gi)

my output is this:

 ["1", "2", "\|Three and Four\|", "5"]

What I did was to create a pattern matching the \| string in first sub-pattern then match anything that is not a |. I escaped the the \ too because otherwise writing that string in javascript would just parse them to the escaped character.

Paul G Mihai
  • 226
  • 1
  • 6
1

If using a JavaScript with a regex engine that supports negative look-behinds (eg. Chrome), and in a case of only a single/simple escape shown, and no method to escape-the-escape, it's possible to use a relatively simple negative look-behind:

'|1|2|\\|Three and Four\\||5'.split(/(?<!\\)\|/)

# -> ["", "1", "2", "\|Three and Four\|", "5"]

This says to - in Chrome which supports negative look-behinds - split on a "|" that is not preceded by a "\".

Here is a method to convert a look-behind to a look-ahead for engine compatibility. Variations are also dicussed in RegEx needed to split javascript string on "|" but not "\|".

However, as pointed out, the above doesn't touch the \| sequence and thus leaves in the escape sequence.


Alternatively, a multistep approach can also solve this, which can also takes care of the escape character as part of the process.

  1. Replace the escaped separators with an "alternate" character/string
  2. Split on the remaining (non-escaped) separators
  3. Convert the "alternate" character/string back in the individual components

In code,

str = '|1|2|\\|Three and Four\\||5'

# replace \| -> "alternative"
# this assumes that \\| (escape-the-escape) is not allowed
rep = str.replace(/\\[|]/g, '~~~~')

# replace back, without any of the escapes
res = rep.split('|').map(function (f) { return f.replace(/~~~~/g, "|") })

# res -> ["", "1", "2", "|Three and Four|", "5"]
user2864740
  • 60,010
  • 15
  • 145
  • 220
  • 1
    Thanks for this. I think I might be getting a tad confused on the \ characters, I'm fairly new to JS. So the final string should contain exactly `|Three and Four|`, without any slashes. – Hexodus Apr 07 '18 at 23:47
  • 1
    @Jamie4840 Ahh yes. That would require a touch-up with the split usage shown as the original \| separator sequence is simply ignored. – user2864740 Apr 07 '18 at 23:49
  • Ah, gotcha! Understood. – Hexodus Apr 07 '18 at 23:54
  • The alternative solution still requires an escaping method for the new separator (`~~~~`), so it's basically shifting the problem on a new layer instead of solving it – etuardu Oct 07 '21 at 14:08
0

Paul G Mihai's answer works fine but does not capture empty strings: a||b|c would return [ "a", "b", "c" ], instead of [ "a", "", "b", "c" ] as one might want.

Elaborating from his solution, here is a way to get also the empty strings, mimicking the same behaviour of split():

str.match(
  /((\\\|)|[^\|])*/gi
).filter(
  (e, i, a) => !(i > 0 && e == "" && a[i-1] != "")
)

What I do here is using match() with the same pattern, but allowing zero-length matches (* instead of +).

This gives me an array of matches with an empty string element for each separator found and at the end of the string, e.g.: a|b|c would return [ "a", "", "b", "", "c", "" ].

Then I filter() it, discarding any empty string element which comes after a non empty string element, so I get rid of the unwanted items.

This seems to handle edge cases correctly as well:

a||b|c         → ["a", "", "b", "c"]
a|b|||c        → ["a", "b", "", "", "c"]
a|b\|b|c|      → ["a", "b\|b", "c", ""]
|a|\|b\||c|    → ["", "a", "\|b\|", "c", ""]
(empty string) → [""]
etuardu
  • 5,066
  • 3
  • 46
  • 58