6

I am totally new to regular expressions. I'm trying to put together an expression that will split the example string using all spaces that are not surrounded by single or double quotes and are not preceded by a '\'

Eg:-

He is a "man of his" words\ always

must be split as

He
is 
a 
"man of his"
words\ always

I understand

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"']+|\"[^\"]*\"|'[^']*'");
Matcher regexMatcher = regex.matcher(StringToBeMatched);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group());
}

l split the example string using all spaces that are not surrounded by single or double quotes

How do I incorporate the third condition of ignoring the white-space if it is preceded by a \ ??

Pshemo
  • 122,468
  • 25
  • 185
  • 269
Sriram Manohar
  • 313
  • 3
  • 10

3 Answers3

3

You can use this regex:

((["']).*?\2|(?:[^\\ ]+\\\s+)+[^\\ ]+|\S+)

RegEx Demo

In Java:

Pattern regex = Pattern.compile ( 
"(([\"']).*?\\2|(?:[^\\\\ ]+\\\\\\s+)+[^\\\\ ]+|\\S+)" );

Explanation:

This regex works on alternation:

  1. First match ([\"']).*?\\2 to match any quoted (double or single) strings.
  2. Then match (?:[^\\ ]+\\\s+)+[^\\ ]+ to match any string with escaped spaces.
  3. Finally Use \S+ to match any word with no spaces.
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Thanks anubhava. can you please explain the expression?? Also it doesnt use single quote? – Sriram Manohar Dec 22 '14 at 18:25
  • 1
    Cool Solution... A bit heavy on the backtracking though. – Edward J Beckett Dec 22 '14 at 18:40
  • 1
    thanks.So for both single and souble quotes to be escaped It must be Pattern regex = Pattern.compile( "(\"[^\"]*\"|'[^']*'|\\S+?(?:\\\\\\s+\\S*)+|\\S+)" ); right?? – Sriram Manohar Dec 22 '14 at 18:44
  • 1
    I am sorry. Was using a crappy mobile and had bi mistake unaccepted it while navigating to another page. – Sriram Manohar Dec 31 '14 at 14:12
  • @anubhava I wasn't criticizing you about the backtracking. It was just a performance note. Moreover, I wasn't able to get passed the backtracking either ;) Happy New Year. – Edward J Beckett Dec 31 '14 at 22:21
  • 1
    Absolutely no issues @EddieB, criticism is always taken in right spirit, I don't want to complicate the regex unless it is required to process huge amount of data. Wish you a very happy new year as well. – anubhava Jan 01 '15 at 03:28
  • The quoting doesn't work for me on JDK 11, using the exact example you gave: `Pattern regex = Pattern.compile( "(([\"']).*?\2|(?:[^\\\\ ]+\\\\\s+)+[^\\\\ ]+|\\S+)" );` -- if quoted text contains spaces, the spaces become delimiters. – Luke Hutchison Nov 22 '20 at 06:35
  • @LukeHutchison: Please try: `Pattern regex = Pattern.compile( "(([\"']).*?\\2|(?:[^\\\\ ]+\\\\\\s+)+[^\\\\ ]+|\\S+)" );` – anubhava Nov 22 '20 at 06:38
  • @anubhava That works, but only if there's a space before the first quote and after the last quote. Otherwise the quotes don't work as delimiters. My intuition would be that in the "correct" regexp, there would be an implicit space before the opening quote and after the closing quote, whether or not there was an explicit space, so that opening quotes start a new token, even if not preceded by a space. – Luke Hutchison Nov 23 '20 at 09:22
  • @LukeHutchison: Would you mind posting a new question with examples – anubhava Nov 23 '20 at 09:24
2

Anubhava's solution is nice...I particularly like his use of S+. My solution is similar in the groupings except for capturing on beginning and ending word boundaries in the third alternate group...

RegEx

(?i)((?:(['|"]).+\2)|(?:\w+\\\s\w+)+|\b(?=\w)\w+\b(?!\w))

For Java

(?i)((?:(['|\"]).+\\2)|(?:\\w+\\\\\\s\\w+)+|\\b(?=\\w)\\w+\\b(?!\\w))

Example

String subject = "He is a \"man of his\" words\\ always 'and forever'";
Pattern pattern = Pattern.compile( "(?i)((?:(['|\"]).+\\2)|(?:\\w+\\\\\\s\\w+)+|\\b(?=\\w)\\w+\\b(?!\\w))" );
Matcher matcher = pattern.matcher( subject );
while( matcher.find() ) {
    System.out.println( matcher.group(0).replaceAll( subject, "$1" ));
}

Result

He
is
a
"man of his"
words\ always
'and forever'

Detailed Explanation

"(?i)" +                 // Match the remainder of the regex with the options: case insensitive (i)
"(" +                    // Match the regular expression below and capture its match into backreference number 1
                            // Match either the regular expression below (attempting the next alternative only if this one fails)
      "(?:" +                  // Match the regular expression below
         "(" +                    // Match the regular expression below and capture its match into backreference number 2
            "['|\"]" +                // Match a single character present in the list “'|"”
         ")" +
         "." +                    // Match any single character that is not a line break character
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         "\\2" +                   // Match the same text as most recently matched by capturing group number 2
      ")" +
   "|" +                    // Or match regular expression number 2 below (attempting the next alternative only if this one fails)
      "(?:" +                  // Match the regular expression below
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
         "\\\\" +                   // Match the character “\” literally
         "\\s" +                   // Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
            "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      ")+" +                   // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
   "|" +                    // Or match regular expression number 3 below (the entire group fails if this one fails to match)
      "\\b" +                   // Assert position at a word boundary
      "(?=" +                  // Assert that the regex below can be matched, starting at this position (positive lookahead)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
      ")" +
      "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
         "+" +                    // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
      "\\b" +                   // Assert position at a word boundary
      "(?!" +                  // Assert that it is impossible to match the regex below starting at this position (negative lookahead)
         "\\w" +                   // Match a single character that is a “word character” (letters, digits, etc.)
      ")" +
")"  
Community
  • 1
  • 1
Edward J Beckett
  • 5,061
  • 1
  • 41
  • 41
0

Regex representing \ and whitespace can look like \\\s where \\ represents \ and \s represents any whitespace. String representing such regex needs to be written as "\\\\\\s" because we need to escape \ in string by adding another \ before it.

So now we may want our pattern to find

  • "..." -> "[^"]*"
  • or '...' - > '[^']*'
  • or characters which are non-whitespace (\S) but also including those whitespaces which have \ before them (\\\s). This one is little tricky because \S can also consume \ placed before space which would prevent \\\s from ever being matched, that is why we want regex-engine to

    • first search for \\\s
    • and later \S.

    So instead of something like (\S|\\\s)+ we need to write this part of regex as (\\\s|\S)+ (because regex engine tries to test and match conditions separated by OR | from left to right - for instance in case of regex like a|ab ab will never be matched because a will be consumed by left part of regex)

So your pattern can look like

Pattern regex = Pattern.compile("\"[^\"]*\"|'[^']*'|(\\\\\\s|\\S)+");
Pshemo
  • 122,468
  • 25
  • 185
  • 269