52

How can I split a string by a delimiter, but not if it is escaped? For example, I have a string:

1|2\|2|3\\|4\\\|4

The delimiter is | and an escaped delimiter is \|. Furthermore I want to ignore escaped backslashes, so in \\| the | would still be a delimiter.

So with the above string the result should be:

[0] => 1
[1] => 2\|2
[2] => 3\\
[3] => 4\\\|4
NikiC
  • 100,734
  • 37
  • 191
  • 225
Anton
  • 811
  • 9
  • 13

5 Answers5

107

Use dark magic:

$array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);

\\\\. matches a backslash followed by a character, (*SKIP)(*FAIL) skips it and \| matches your delimiter.

NikiC
  • 100,734
  • 37
  • 191
  • 225
  • 6
    Is there documentation for `(*SKIP)(*FAIL)`? – eyelidlessness Jun 06 '11 at 06:10
  • 17
    @eyelidlessness: You can have a look into the [PCRE documentation](http://www.pcre.org/pcre.txt). Search for `(*SKIP)`. You'll find the documentation for all those backtracking control verbs like *SKIP, *FAIL, *ACCEPT, *PRUNE, ... there. – NikiC Jun 06 '11 at 13:27
  • 5
    +1 @NikiC both for providing a link to the PCRE documentation and making me want to read it. – Peter Oct 16 '11 at 13:35
  • I was trying to do it without backtracking control verbs. However, since it's not possible to use negative lookbehind assertions with a non-fixed length, that one becomes pretty hard. (Well unless you use Anton's solution below.) Guess you'd better do some magic, just like NikiC did. ;-) – MC Emperor Jun 27 '12 at 21:22
  • `+1` - amazing answer. But shouldn't the 3rd index of the array be `4\\\|4`? -- [It is `4` right now](http://3v4l.org/iDpMV) – Amal Murali Nov 10 '13 at 17:21
  • 2
    @AmalMurali `a\\\a` will be *two* backslashes after PHP parsed the string ;) `a\\\a` is the same as `a\\\\a` to PHP :) – NikiC Nov 10 '13 at 19:59
  • @MCEmperor You might look into the [`\K` escape sequence](http://stackoverflow.com/questions/13542950/support-of-k-in-regex). See [demo](http://regex101.com/r/qK1aK0). – HamZa Nov 13 '13 at 21:54
  • `(*SKIP)(*F)` is terrific... +1 :) – zx81 Jun 25 '14 at 23:51
  • @AmalMurali You will need to escape the input even more for the actual result to be correct. http://3v4l.org/VboWD – eisberg Jan 05 '15 at 08:33
11

Instead of split(...), it's IMO more intuitive to use some sort of "scan" function that operates like a lexical tokenizer. In PHP that would be the preg_match_all function. You simply say you want to match:

  1. something other than a \ or |
  2. or a \ followed by a \ or |
  3. repeat #1 or #2 at least once

The following demo:

$input = "1|2\\|2|3\\\\|4\\\\\\|4";
echo $input . "\n\n";
preg_match_all('/(?:\\\\.|[^\\\\|])+/', $input, $parts);
print_r($parts[0]);

will print:

1|2\|2|3\\|4\\\|4

Array
(
    [0] => 1
    [1] => 2\|2
    [2] => 3\\
    [3] => 4\\\|4
)
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
  • 1
    I always have weird edge cases when trying to match instead of split. For example, with `+`, you remove empty elements: `a||c`. With `*` you may get empty elements you don't really want (though not here, I think). Somehow, it never works quite the same... – Kobi Oct 03 '11 at 09:33
  • Sorry, my bad, I thought you were escaping the pipe in `[^\\|]`. I'm a little off. – Kobi Oct 03 '11 at 09:35
  • You raise a good point Kobi. No, I wouldn't change the `+` into `*` (I guess to many empty strings'd be matched). If the corner case of an empty string between pipes can occur, I'd handle it like this `...|(?<=^|\|)(?=$|\|)`, where `...` is the existing regex. – Bart Kiers Oct 03 '11 at 09:40
  • +1 This probably is the more intuitive way to solve this particular problem. But I found that the dark magic approach also has many other nice use cases (especially in quick & dirty regexing). For example if you want to scan some files but don't want to match in strings and comments you can simply SKIP those. (And there's obviously the problem with the empty elements, but Kobi already said that.) – NikiC Oct 04 '11 at 18:40
4

For future readers, here is a universal solution. It is based on NikiC's idea with (*SKIP)(*FAIL):

function split_escaped($delimiter, $escaper, $text)
{
    $d = preg_quote($delimiter, "~");
    $e = preg_quote($escaper, "~");
    $tokens = preg_split(
        '~' . $e . '(' . $e . '|' . $d . ')(*SKIP)(*FAIL)|' . $d . '~',
        $text
    );
    $escaperReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $escaper);
    $delimiterReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $delimiter);
    return preg_replace(
        ['~' . $e . $e . '~', '~' . $e . $d . '~'],
        [$escaperReplacement, $delimiterReplacement],
        $tokens
    );
}

Make a try:

// the base situation:
$text = "asdf\\,fds\\,ddf,\\\\,f\\,,dd";
$delimiter = ",";
$escaper = "\\";
print_r(split_escaped($delimiter, $escaper, $text));

// other signs:
$text = "dk!%fj%slak!%df!!jlskj%%dfl%isr%!%%jlf";
$delimiter = "%";
$escaper = "!";
print_r(split_escaped($delimiter, $escaper, $text));

// delimiter with multiple characters:
$text = "aksd()jflaksd())jflkas(('()j()fkl'()()as()d('')jf";
$delimiter = "()";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));

// escaper is same as delimiter:
$text = "asfl''asjf'lkas'''jfkl''d'jsl";
$delimiter = "'";
$escaper = "'";
print_r(split_escaped($delimiter, $escaper, $text));

Output:

Array
(
    [0] => asdf,fds,ddf
    [1] => \
    [2] => f,
    [3] => dd
)
Array
(
    [0] => dk%fj
    [1] => slak%df!jlskj
    [2] => 
    [3] => dfl
    [4] => isr
    [5] => %
    [6] => jlf
    )
Array
(
    [0] => aksd
    [1] => jflaksd
    [2] => )jfl'kas((()j
    [3] => fkl()
    [4] => as
    [5] => d(')jf
)
Array
(
    [0] => asfl'asjf
    [1] => lkas'
    [2] => jfkl'd
    [3] => jsl
)

Note: There is a theoretical level problem: implode('::', ['a:', ':b']) and implode('::', ['a', '', 'b']) result the same string: 'a::::b'. Imploding can be also an interesting problem.

Dávid Horváth
  • 4,050
  • 1
  • 20
  • 34
4

Recently I devised a solution:

$array = preg_split('~ ((?<!\\\\)|(?<=[^\\\\](\\\\\\\\)+)) \| ~x', $string);

But the black magic solution is still three times faster.

Alix Axel
  • 151,645
  • 95
  • 393
  • 500
Anton
  • 811
  • 9
  • 13
  • Would you mind sharing how you measured the performance of your regex vs. that of the black magic solution? – Peter Oct 15 '11 at 23:03
-1

Regex is painfully slow. A better method is removing escaped characters from the string prior to splitting then putting them back in:

$foo = 'a,b|,c,d||,e';

function splitEscaped($str, $delimiter,$escapeChar = '\\') {
    //Just some temporary strings to use as markers that will not appear in the original string
    $double = "\0\0\0_doub";
    $escaped = "\0\0\0_esc";
    $str = str_replace($escapeChar . $escapeChar, $double, $str);
    $str = str_replace($escapeChar . $delimiter, $escaped, $str);

    $split = explode($delimiter, $str);
    foreach ($split as &$val) $val = str_replace([$double, $escaped], [$escapeChar, $delimiter], $val);
    return $split;
}

print_r(splitEscaped($foo, ',', '|'));

which splits on ',' but not if escaped with "|". It also supports double escaping so "||" becomes a single "|" after the split happens:

Array ( [0] => a [1] => b,c [2] => d| [3] => e ) 
Tom B
  • 2,735
  • 2
  • 24
  • 30