Regex and escaped and unescaped delimiter

Question

question related to this

I have a string

a\;b\\;c;d

which in Java looks like

String s = "a\\;b\\\\;c;d"

I need to split it by semicolon with following rules:

If semicolon is preceded by backslash, it should not be treated as separator (between a and b).
If backslash itself is escaped and therefore does not escape itself semicolon, that semicolon should be separator (between b and c).

So semicolon should be treated as separator if there is either zero or even number of backslashes before it.

For example above, I want to get following strings (double backslashes for java compiler):

a\;b\\
c
d

I'm also not sure if regular expressions are the best tool for this task. But you chose to ignore my answer below ;-/ — hochl, Oct 26 '11 at 12:20

score 9 · Accepted Answer · answered Oct 26 '11 at 11:33

9

You can use the regex

(?:\\.|[^;\\]++)*

to match all text between unescaped semicolons:

List<String> matchList = new ArrayList<String>();
try {
    Pattern regex = Pattern.compile("(?:\\\\.|[^;\\\\]++)*");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        matchList.add(regexMatcher.group());
    }

Explanation:

(?:        # Match either...
 \\.       # any escaped character
|          # or...
 [^;\\]++  # any character(s) except semicolon or backslash; possessive match
)*         # Repeat any number of times.

The possessive match (++) is important to avoid catastrophic backtracking because of the nested quantifiers.

answered Oct 26 '11 at 11:33

Tim Pietzcker

328,213
58
503
561

It also returns empty strings, so I got `[a\;b\\, , c, , d, ]`. Is it possible somehow prevent it, except checking returned value of group()? – lstipakov Oct 26 '11 at 11:53
Yes, with a + instead of *, you get rid of the empty strings – Maurice Perry Oct 26 '11 at 11:56
Strange, it doesn't do this in my tests (in RegexBuddy, though). Well, if you don't want empty matches, change the `*` to `+`, but then you'll also not get "real" empty matches like in `a;;b`. – Tim Pietzcker Oct 26 '11 at 11:58
yep, real empty matches are fine. – lstipakov Oct 26 '11 at 12:16
FYI edge case: when last field ends with a escape char '\', or the input is just a sole escape char, the last escape char is lost, that is "a\" => ["a", "", ""]. The following seems to fix that edge case `"(?:\\\\(.|$)|[^;\\\\]++)*"` but not sure if creates another. My expression (so far) to also solve false empty fields but retain real empty fields is `"(?<=(?:^|;))(?:\\\\(?:.|$)|[^;\\\\]++)*"`. Thanks for neat idea. – AnyDev Aug 21 '20 at 04:43

hochl · Answer 2 · 2022-06-15T12:49:13.593

I do not trust to detect those cases with any kind of regular expression. I usually do a simple loop for such things, I'll sketch it using C since it's ages ago I last touched Java ;-)

int i, len, state;
char c;

for (len=myString.size(), state=0, i=0; i < len; i++) {
    c=myString[i];
    if (state == 0) {
       if (c == '\\') {
            state++;
       } else if (c == ';') {
           printf("; at offset %d", i);
       }
    } else {
        state--;
    }
}

The advantages are:

you can execute semantic actions on each step.
it's quite easy to port it to another language.
you don't need to include the complete regex library just for this simple task, which adds to portability.
it should be a lot faster than the regular expression matcher.

EDIT: I have added a complete C++ example for clarification.

#include <iostream>                                                             
#include <sstream>                                                              
#include <string>                                                               
#include <vector>                                                               
                                                                                
std::vector<std::string> unescapeString(const char* s)                        
{                                                                               
    std::vector<std::string> result;                                            
    std::stringstream ss;                                                       
    bool has_chars;                                                             
    int state;                                                                  
                                                                                
    for (has_chars = false, state = 0;;) {                                      
        auto c = *s++;                                                          
                                                                                
        if (state == 0) {                                                       
            if (!c) {                                                           
                if (has_chars) result.push_back(ss.str());                      
                break;                                                          
            } else if (c == '\\') {                                             
                ++state;                                                        
            } else if (c == ';') {                                              
                if (has_chars) {                                                
                    result.push_back(ss.str());                                 
                    has_chars = false;                                          
                    ss.str("");                                                 
                }                                                               
            } else {                                                            
                ss << c;                                                        
                has_chars = true;                                               
            }                                                                   
        } else /* if (state == 1) */ {                                          
            if (!c) {                                                           
                ss << '\\';                                                     
                result.push_back(ss.str());                                     
                break;                                                          
            }                                                                   
                                                                                
            ss << c;                                                            
            has_chars = true;                                                   
            --state;                                                            
        }                                                                       
    }                                                                           
                                                                                
    return result;                                                              
}                                                                               
                                                                                
int main(int argc, char* argv[])                                                
{                                                                               
    for (size_t i = 1; i < argc; ++i) {                                         
        for (const auto& s: unescapeString(argv[i])) {                          
            std::cout << s << std::endl;                                        
        }                                                                       
    }                                                                           
}

I like this approach. However, it only solves half the problem. Once the delimiters are found, you need to (in addition to splitting) _unescape_ the individual parts. — aioobe, Jun 14 '22 at 19:41
That is not actually a problem since you can always execute code in each state. You can, for example, add a `putc(c)` in an additional `else` clause in `state 0` and to the `else` part in `state 1`. You can then add a `putc('\n')` to the `;` branch of `state 0` and a terminating one when the last character was read. Alternatively it would be possible to generate each string char-by-char and append the completed, unescaped string to a container with all strings found. I recommend using regex only if you are already using it for other purposes in the project, or if the problem is more complex. — hochl, Jun 15 '22 at 12:17
Right, it's not actually a problem in the sense that the solution was wrong, it was just incomplete. FWIW I deem this answer (especially now) to be the best one and the one I used to implement my solution. I think I came up with a slightly more condense version of your algorithm (it's in Java though): https://pastebin.com/FiSApkDJ I don't know if it can be translated easily to idiomatic C++ though. — aioobe, Jun 15 '22 at 18:43
It seems to do about the same, but I had to grok the missing `break;` in `case '\\':` to understand how you add escaped characters :D I unfortunately still haven't used Java since the early 2000s :/ that's why I posted a C++ answer ^^ — hochl, Jun 20 '22 at 09:57

score 0 · Answer 3 · answered Jul 13 '18 at 13:29

This is the real answer i think. In my case i am trying to split using | and escape character is &.

    final String regx = "(?<!((?:[^&]|^)(&&){0,10000}&))\\|";
    String[] res = "&|aa|aa|&|&&&|&&|s||||e|".split(regx);
    System.out.println(Arrays.toString(res));

In this code i am using Lookbehind to escape & character. note that the look behind must have maximum length.

(?<!((?:[^&]|^)(&&){0,10000}&))\\|

this means any | except those that are following ((?:[^&]|^)(&&){0,10000}&)) and this part means any odd number of &s. the part (?:[^&]|^) is important to make sure that you are counting all of the &s behind the | to the beginning or some other characters.

FailedDev · Answer 4 · 2011-10-26T11:44:49.003

String[] splitArray = subjectString.split("(?<!(?<!\\\\)\\\\);");

This should work.

Explanation :

// (?<!(?<!\\)\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\)\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\)»
//       Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»

So you just match the semicolons not preceded by exactly one \.

EDIT :

String[] splitArray = subjectString.split("(?<!(?<!\\\\(\\\\\\\\){0,2000000})\\\\);");

This will take care of any odd number of . It will of course fail if you have more than 4000000 number of \. Explanation of edited answer :

// (?<!(?<!\\(\\\\){0,2000000})\\);
// 
// Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!(?<!\\(\\\\){0,2000000})\\)»
//    Assert that it is impossible to match the regex below with the match ending at this position (negative lookbehind) «(?<!\\(\\\\){0,2000000})»
//       Match the character “\” literally «\\»
//       Match the regular expression below and capture its match into backreference number 1 «(\\\\){0,2000000}»
//          Between zero and 2000000 times, as many times as possible, giving back as needed (greedy) «{0,2000000}»
//          Note: You repeated the capturing group itself.  The group will capture only the last iteration.  Put a capturing group around the repeated group to capture all iterations. «{0,2000000}»
//          Match the character “\” literally «\\»
//          Match the character “\” literally «\\»
//    Match the character “\” literally «\\»
// Match the character “;” literally «;»

This fails for `a\\\;b;c` and other cases with more than two backslashes. — Tim Pietzcker, Oct 26 '11 at 11:38
Can someone explain the downvoting? Unless I am missing something obvious? — FailedDev, Oct 26 '11 at 11:53
I don't know, it wasn't me. Perhaps nested backreferences are a bit too complicated? — Tim Pietzcker, Oct 26 '11 at 12:00
It wasn't me either, but don't look to me for an upvote. ;) That `{0,many}` hack is untrustworthy because Java's variable-width lookbehind support is notoriously buggy. But I wouldn't use this approach even in .NET, which imposes no restrictions at all on lookbehinds. A positive-matching approach like Tim's is more readable, more reliable, and much more portable (the possessive quantifiers are not essential). — Alan Moore, Oct 26 '11 at 12:04
@AlanMoore Agreed. This does not mean the solution is wrong though. — FailedDev, Oct 26 '11 at 12:07

score 0 · Answer 5 · answered Oct 26 '11 at 12:05

This approach assumes that your string will not have char '\0' in your string. If you do, you can use some other char.

public static String[] split(String s) {
    String[] result = s.replaceAll("([^\\\\])\\\\;", "$1\0").split(";");
    for (int i = 0; i < result.length; i++) {
        result[i] = result[i].replaceAll("\0", "\\\\;");
    }
    return result;
}

Regex and escaped and unescaped delimiter

5 Answers5

Linked