32

I need to write a extended version of the StringUtils.commaDelimitedListToStringArray function which gets an additional parameter: the escape char.

so calling my:

commaDelimitedListToStringArray("test,test\\,test\\,test,test", "\\")

should return:

["test", "test,test,test", "test"]



My current attempt is to use String.split() to split the String using regular expressions:

String[] array = str.split("[^\\\\],");

But the returned array is:

["tes", "test\,test\,tes", "test"]

Any ideas?

arturh
  • 6,056
  • 4
  • 39
  • 48

6 Answers6

40

The regular expression

[^\\],

means "match a character which is not a backslash followed by a comma" - this is why patterns such as t, are matching, because t is a character which is not a backslash.

I think you need to use some sort of negative lookbehind, to capture a , which is not preceded by a \ without capturing the preceding character, something like

(?<!\\),

(BTW, note that I have purposefully not doubly-escaped the backslashes to make this more readable)

matt b
  • 138,234
  • 66
  • 282
  • 345
  • 2
    This again will incorrecty split a string like "test\\,tost" (also intentionally not doubly escaped), which should have been split up as "test\\" and "tost". To overcome this, I once found this (Java) regex: "(?<=(?<!\\\)(\\\\\){0,100})," which still is not perfect (and still needs to be doubly escaped, i.e. "(?<=(?<!\\\\\)(\\\\\\\\\){0,100})," ). But it'll do – drvdijk Mar 10 '11 at 20:50
  • 1
    While this works, it does not yield the desired result. It backslash is retained. – Michael-O Mar 15 '13 at 11:59
32

Try:

String array[] = str.split("(?<!\\\\),");

Basically this is saying split on a comma, except where that comma is preceded by two backslashes. This is called a negative lookbehind zero-width assertion.

cletus
  • 616,129
  • 168
  • 910
  • 942
  • that works quite well ... thank you result is: ["test", "test\,test\,test", "test"] – arturh May 04 '09 at 13:59
  • 7
    Actually, it matches a comma preceded by ONE backslash. In a regex written as a Java String literal, it takes FOUR backslashes to match ONE in the target text. – Alan Moore May 05 '09 at 10:29
  • If you wanna remove the slashes too, use this: Arrays.stream(text.split("(?<!\\\\)\\s", -1)) .map(s -> s.replaceAll("(?<!\\\\)\\\\", "")).collect(Collectors.toList()) – user1079877 Nov 04 '17 at 14:13
6

For future reference, here is the complete method i ended up with:

public static String[] commaDelimitedListToStringArray(String str, String escapeChar) {
    // these characters need to be escaped in a regular expression
    String regularExpressionSpecialChars = "/.*+?|()[]{}\\";

    String escapedEscapeChar = escapeChar;

    // if the escape char for our comma separated list needs to be escaped 
    // for the regular expression, escape it using the \ char
    if(regularExpressionSpecialChars.indexOf(escapeChar) != -1) 
        escapedEscapeChar = "\\" + escapeChar;

    // see http://stackoverflow.com/questions/820172/how-to-split-a-comma-separated-string-while-ignoring-escaped-commas
    String[] temp = str.split("(?<!" + escapedEscapeChar + "),", -1);

    // remove the escapeChar for the end result
    String[] result = new String[temp.length];
    for(int i=0; i<temp.length; i++) {
        result[i] = temp[i].replaceAll(escapedEscapeChar + ",", ",");
    }

    return result;
}
arturh
  • 6,056
  • 4
  • 39
  • 48
  • 1
    Escaping doesn't need to be that difficult: String[] temp = str.split("(?<!\\Q" + escapeChar + "\\E),", -1); – Alan Moore May 05 '09 at 12:29
  • @AlanMoore You can also use the `Pattern.quote()` method to "escape" a string for safe inclusion in a regular expression. The method wraps the string in `\Q` and `\E`. – Michael Jun 24 '12 at 17:01
  • @Michael: That's true, and as a bonus it escapes any literal `\E` sequence in the original so it doesn't get treated as an ending delimiter. I never liked the `\Q..\E` feature in Perl, and I like it even less in Java, but that's no excuse for handing out bad advice about it. Cheers! – Alan Moore Jun 25 '12 at 00:10
  • @AlanMoore I think using `\Q..\E` (whether it be hard-coded or through `Pattern.quote()`) is better than defining a list of special characters and adding a backslash if the given character is in the list. – Michael Jun 25 '12 at 01:28
2

As matt b said, [^\\], will interpret the character preceding the comma as a part of the delimiter.

"test\\\\\\,test\\\\,test\\,test,test"
  -(split)->
["test\\\\\\,test\\\\,test\\,tes" , "test"]

As drvdijk said, (?<!\\), will misinterpret escaped backslashes.

"test\\\\\\,test\\\\,test\\,test,test"
  -(split)->
["test\\\\\\,test\\\\,test\\,test" , "test"]
  -(unescape commas)->
["test\\\\,test\\,test,test" , "test"]

I would expect being able to escape backslashes as well...

"test\\\\\\,test\\\\,test\\,test,test"
  -(split)->
["test\\\\\\,test\\\\" , "test\\,test" , "test"]
  -(unescape commas and backslashes)->
["test\\,test\\" , "test,test" , "test"]

drvdijk suggested (?<=(?<!\\\\)(\\\\\\\\){0,100}), which works well for lists with elements ending with up to 100 backslashes. This is far enough... but why a limit? Is there a more efficient way (isn't lookbehind greedy)? What about invalid strings?

I searched for a while for a generic solution, then I wrote the thing myself... The idea is to split following a pattern that matches the list elements (instead of matching the delimiter).

My answer does not take the escape character as a parameter.

public static List<String> commaDelimitedListStringToStringList(String list) {
    // Check the validity of the list
    // ex: "te\\st" is not valid, backslash should be escaped
    if (!list.matches("^(([^\\\\,]|\\\\,|\\\\\\\\)*(,|$))+")) {
        // Could also raise an exception
        return null;
    }
    // Matcher for the list elements
    Matcher matcher = Pattern
            .compile("(?<=(^|,))([^\\\\,]|\\\\,|\\\\\\\\)*(?=(,|$))")
            .matcher(list);
    ArrayList<String> result = new ArrayList<String>();
    while (matcher.find()) {
        // Unescape the list element
        result.add(matcher.group().replaceAll("\\\\([\\\\,])", "$1"));
    }
    return result;
}

Description for the pattern (unescaped):

(?<=(^|,)) forward is start of string or a ,

([^\\,]|\\,|\\\\)* the element composed of \,, \\ or characters wich are neither \ nor ,

(?=(,|$)) behind is end of string or a ,

The pattern may be simplified.

Even with the 3 parsings (matches + find + replaceAll), this method seems faster than the one suggested by drvdijk. It can still be optimized by writing a specific parser.

Also, what is the need of having an escape character if only one character is special, it could simply be doubled...

public static List<String> commaDelimitedListStringToStringList2(String list) {
    if (!list.matches("^(([^,]|,,)*(,|$))+")) {
        return null;
    }
    Matcher matcher = Pattern.compile("(?<=(^|,))([^,]|,,)*(?=(,|$))")
                    .matcher(list);
    ArrayList<String> result = new ArrayList<String>();
    while (matcher.find()) {
        result.add(matcher.group().replaceAll(",,", ","));
    }
    return result;
}
boumbh
  • 2,010
  • 1
  • 19
  • 21
1

split(/(?<!\\),/g) worked for me, but the accepted answer did not

> var x = "test,test\,test\,test,test"
undefined
> x.split(/(?<!\\),/g)
[ 'test', 'test\\,test\\,test', 'test' ]
> x.split("(?<!\\\\),")
[ 'test,test\\,test\\,test,test' ]
Alex
  • 9,250
  • 11
  • 70
  • 81
0

It's probably not "super fancy" solution, but possibly more time-efficient one. Escaping an escape character is also supported and it's working in browsers not supporting 'lookbehinds'.

function splitByDelimiterIfItIsNotEscaped (text, delimiter, escapeCharacter) {
    const splittedText = []
    let numberOfDelimitersBeforeOtherCharacter = 0
    let nextSplittedTextPartIndex = 0
    for (let characterIndex = 0, character = text[0]; characterIndex < text.length; characterIndex++, character = text[characterIndex]) {
        if (character === escapeCharacter) {
            numberOfDelimitersBeforeOtherCharacter++
        } else if (character === delimiter && (!numberOfDelimitersBeforeOtherCharacter || !(numberOfDelimitersBeforeOtherCharacter % 2))) {
            splittedText.push(text.substring(nextSplittedTextPartIndex, characterIndex))
            nextSplittedTextPartIndex = characterIndex + 1
        } else {
            numberOfDelimitersBeforeOtherCharacter = 0
        }
    }
    if (nextSplittedTextPartIndex <= text.length) {
        splittedText.push(text.substring(nextSplittedTextPartIndex, text.length))
    }
    return splittedText
}

function onChange () {
    console.log(splitByDelimiterIfItIsNotEscaped(inputBox.value, ',', '\\'))
}

addEventListener('change', onChange)

onChange()
After making a change unfocus the input box (use tab for example).
<input id="inputBox" value="test,test\,test\,test,test"/>
kcpr
  • 1,055
  • 1
  • 12
  • 28