10

I am working on a project in Java that requires having nested strings.

For an input string that in plain text looks like this:

This is "a string" and this is "a \"nested\" string"

The result must be the following:

[0] This
[1] is
[2] "a string"
[3] and
[4] this
[5] is
[6] "a \"nested\" string"

Note that I want the \" sequences to be kept.
I have the following method:

public static String[] splitKeepingQuotationMarks(String s);

and I need to create an array of strings out of the given s parameter by the given rules, without using the Java Collection Framework or its derivatives.

I am unsure about how to solve this problem.
Can a regex expression be made that would get this solved?

UPDATE based on questions from comments:

  • each unescaped " has its closing unescaped " (they are balanced)
  • each escaping character \ also must be escaped if we want to create literal representing it (to create text representing \ we need to write it as \\).
bobasti
  • 1,778
  • 1
  • 20
  • 28
  • @Turtle: Not always. It will split the `nested` string too. –  Mar 29 '16 at 18:45
  • even if you split on a space? – Turtle Mar 29 '16 at 18:47
  • That isn't a regular language. You need features beyond ordinary regular expressions. Look-arounds extend regex to beyond regular languages, but since this sounds like a school assignment, the goal might be to get you to write a lexer (lexical analyzer). – jpmc26 Mar 29 '16 at 18:48
  • That is exactly what I am doing - writing a Lexer. – bobasti Mar 29 '16 at 18:50
  • Possible duplicate of [Parse string with whitespace and quotation mark (with quotation mark retained)](http://stackoverflow.com/questions/34607051/parse-string-with-whitespace-and-quotation-mark-with-quotation-mark-retained) – rpy Mar 29 '16 at 18:56
  • 1
    I don't think so - that question doesn't mention nested strings. – bobasti Mar 29 '16 at 19:05
  • Can we assume that string is always balanced? Like each `"` has its proper closing `"`? – Pshemo Mar 29 '16 at 19:09
  • Yes. At least the most outer string. – bobasti Mar 29 '16 at 19:13
  • Do the quotes inside a nested quote have the `\"` or is it a plain `"`? – Matthew Wright Mar 29 '16 at 19:23
  • Is it possible that inside quote text will end with ``\``? I mean, what if we want to quote ``path = dir1\dir2\``? If I write `"path = dir1\dir2\"` then last `\"` would represent escaped `"` which will prevent quote from being properly closed here. Can we assume that each ``\`` (at least inside quotes) also requires escaping with another ``\``? – Pshemo Mar 29 '16 at 20:07
  • The nested quotes must have `\"` (with an escape-character). Also, the escape-character itself must be escaped. – bobasti Mar 29 '16 at 20:16
  • Does it mean you need a `String[]` variable at the end? – Wiktor Stribiżew Mar 29 '16 at 20:19
  • Yes. This would be trivial if I am allowed to use the _Java Collection Framework_, but this is not the case. Thank you for the updated answer. – bobasti Mar 29 '16 at 20:29

3 Answers3

10

You can use the following regex:

"[^"\\]*(?:\\.[^"\\]*)*"|\S+

See the regex demo

Java demo:

String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    System.out.println(matcher.group(0));
}

Explanation:

  • "[^"\\]*(?:\\.[^"\\]*)*" - a double quote that is followed with any 0+ characters other than a " and \ ([^"\\]) followed with 0+ sequences of any escaped sequence (\\.) followed with any 0+ characters other than a " and \
  • | - or...
  • \S+ - 1 or more non-whitespace characters

NOTE

@Pshemo's suggestion - "\"(?:\\\\.|[^\"])*\"|\\S+" (or "\"(?:\\\\.|[^\"\\\\])*\"|\\S+" would be more correct) - is the same expression, but much less efficient since it is using an alternation group quantified with *. This construct involves much more backtracking as the regex engine has to test each position, and there are 2 probabilities for each position. My unroll-the-loop based version will match chunks of text at once, and is thus much faster and reliable.

UPDATE

Since String[] type is required as output, you need to do it in 2 steps: count the matches, create the array, and then re-run the matcher again:

int cnt = 0;
String str = "This is \"a string\" and this is \"a \\\"nested\\\" string\""; 
Pattern ptrn = Pattern.compile("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"|\\S+");
Matcher matcher = ptrn.matcher(str);
while (matcher.find()) {
    cnt++;
}
System.out.println(cnt);
String[] result = new String[cnt];
matcher.reset();
int idx = 0;
while (matcher.find()) {
    result[idx] = matcher.group(0);
    idx++;
}
System.out.println(Arrays.toString(result));

See another IDEONE demo

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • WTF ...! How did you do that ..! +1 – Shafizadeh Mar 29 '16 at 18:51
  • 2
    @Shafizadeh I added the explanation, and now giving the comp to my nagging wife :) – Wiktor Stribiżew Mar 29 '16 at 18:55
  • `Pattern.compile("\"(?:\\\\.|[^\"])*\"|\\S+");` should probably also work. – Pshemo Mar 29 '16 at 19:24
  • @Pshemo: I think that unroll-the-loop based regex is much more efficient than a non-unrolled one. Just compare them at regex101.com, and you will feel the difference. You may even test in Java. I am sure my version is much less stack overflow error prone than your version with alternation group. – Wiktor Stribiżew Mar 29 '16 at 19:25
  • @WiktorStribiżew Interesting. I should probably read some book about regex and its optimization. – Pshemo Mar 29 '16 at 19:32
  • How do you get from this regex to an array? – erickson Mar 29 '16 at 19:41
  • @erickson: It is easy: just add a `List` and add matches into it... I have added a code example. – Wiktor Stribiżew Mar 29 '16 at 19:42
  • @WiktorStribiżew `List` is not allowed: *"without using the Java Collection Framework or its derivatives."* – erickson Mar 29 '16 at 19:44
  • @erickson: **You** asked for a list as an output. – Wiktor Stribiżew Mar 29 '16 at 19:45
  • No I didn't. What are you talking about? – erickson Mar 29 '16 at 19:46
  • Then I do not get you. – Wiktor Stribiżew Mar 29 '16 at 19:46
  • Where did I lose you? The OP asked for an array result, and his constraints don't permit use of the Java Collections Framework. You've provided a partial answer, and my initial comment was intended to challenge you to address this aspect as well. – erickson Mar 29 '16 at 19:49
  • Thank you for the answer and a step-by-step explanation of the regex. This works flawlessly. Sorry I caused this confusion about the _Java Collection Framework_, this is because I edited the question some time after this answer was posted. I created an array by counting the `matcher.find()` **true** results, did a `matcher.reset();`, and then doing the `matcher.find()` and `matcher.group(0)` in a _for-loop_ and adding these results to the array. This may not be very efficient, but I don't see another way to create an array. – bobasti Mar 29 '16 at 20:21
  • I have been working on the implementation of the same idea. I updated the answer with [**this demo**](http://ideone.com/F212Qz) – Wiktor Stribiżew Mar 29 '16 at 20:25
  • 1
    Thank you. This is the same as what I made a few moments ago. Marked as accepted answer. – bobasti Mar 29 '16 at 20:30
  • Oracle [proposes](http://www.theregister.co.uk/2015/05/13/oracle_proposes_to_deliver_of_java_9_sdk_on_september_22nd_2016/) to deliver Java 9 SDK on September 22nd, 2016, and there is going to be a [`Stream results()`](http://download.java.net/jdk9/docs/api/java/util/regex/Matcher.html#results--) method that might be handy for obtaining the count of matches and matches themselves. Not sure a stream can be used in this scenario though. – Wiktor Stribiżew Mar 29 '16 at 20:36
  • [One more attemp to get rid of two matchers](http://ideone.com/zQBT4c). Not tested on more strings, just added spaces at the start and end. – Wiktor Stribiżew Mar 29 '16 at 21:15
7

Another regex approach that works uses a negative lookbehind: "words" (\w+) OR "quote followed by anything up to the next quote that ISN'T preceded by a backslash", and set your match to "global" (don't return on first match)

(\w+|".*?(?<!\\)")

see it here.

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43
  • 1
    That's a nice pattern, +1 – Shafizadeh Mar 29 '16 at 18:54
  • But how do you go from a token regex to an array of matches, without using a `List`? The `split` APIs use a delimiter expression, not a token expression. – erickson Mar 29 '16 at 18:55
  • 1
    @erickson: not sure what you mean..? – Scott Weaver Mar 29 '16 at 19:00
  • 2
    This is a wrong solution that will fail if there is an escaped ``\`` right before a `"`. **One can't parse such grammars with lookaheads like this**. – Wiktor Stribiżew Mar 29 '16 at 19:23
  • The OP says, "I need to create an array of strings out of the given s parameter" How do you get from the regex to the array? – erickson Mar 29 '16 at 19:40
  • There's no requirement to escape \ – erickson Mar 29 '16 at 20:01
  • Thank you for the answer. It worked great until a string with the `=` character came up. It skips the `=` for some reason. – bobasti Mar 29 '16 at 20:20
  • @erickson ``There's no requirement to escape \`` now it officially is, at least inside quotes (I asked OP about it). BTW if I remember correctly you ware trying to create ``\`` (inside code sample). To do so you need to surround \ with two `\`` (from bot sides) like `\`\`\\`\``. – Pshemo Mar 29 '16 at 20:34
  • Thank you for noticing. :D Noted. – bobasti Mar 29 '16 at 20:42
2

An alternative method that does not use a regex:

import java.util.ArrayList;
import java.util.Arrays;

public class SplitKeepingQuotationMarks {
    public static void main(String[] args) {
        String pattern = "This is \"a string\" and this is \"a \\\"nested\\\" string\"";
        System.out.println(Arrays.toString(splitKeepingQuotationMarks(pattern)));
    }

    public static String[] splitKeepingQuotationMarks(String s) {
        ArrayList<String> results = new ArrayList<>();
        StringBuilder last = new StringBuilder();
        boolean inString = false;
        boolean wasBackSlash = false;

        for (char c : s.toCharArray()) {
            if (Character.isSpaceChar(c) && !inString) {
                if (last.length() > 0) {
                    results.add(last.toString());
                    last.setLength(0); // Clears the s.b.
                }
            } else if (c == '"') {
                last.append(c);
                if (!wasBackSlash)
                    inString = !inString;
            } else if (c == '\\') {
                wasBackSlash = true;
                last.append(c);
            } else
                last.append(c); 
        }

        results.add(last.toString());
        return results.toArray(new String[results.size()]);
    }
}

Output:

[This, is, "a string", and, this, is, "a \"nested\" string"]

Majora320
  • 1,321
  • 1
  • 13
  • 33
  • `import java.util.ArrayList;` -> "without using the Java Collection Framework or its derivatives." – Pshemo Mar 29 '16 at 19:09