0

I have a String which i need to split based on the space and the exact matching quotes.

If the

string = "It is fun \"to write\" regular\"expression"

After the Split i want the result to be :

It

is

fun

"to write"

regular

"expression

The regular expression from which i came to some thing close to do this was :

STRING_SPLIT_REGEXP = "[^\\s\"']+|\"([^\"]*)\"|'([^']*)'"

Thanks in advance for answers.

Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • Do you have to use `split` method? It is quite easy to write your own parser which can split it in one iteration over your string. – Pshemo Mar 14 '14 at 21:53
  • Also shouldn't there be space in `regular\"expression` before `\"`? – Pshemo Mar 14 '14 at 21:59
  • I doubt this is a regular language, so regular expressions won't work. People who want to use regular expressions instead of parsers for everything make me sad. – David Conrad Mar 14 '14 at 22:01

3 Answers3

2

It seems that you just used regex from this answer, but as you could see it doesn't use split but find method from Matcher class. Also this answer takes care of ' where your input shows no signs of it.

So you can improve this regex by removing parts handling ' which will make it look like

[^\\s\"]+|\"([^\"]*)\"

Also since you want to include " as part of token then you don't need to place match from between " in separate group, so get rid of parenthesis in \"([^\"]*)\" part

[^\\s\"]+|\"[^\"]*\"

Now all you need to do is add case where there will be no closing ", but instead you will get end of string. So change this regex to

[^\\s\"]+|\"[^\"]*(\"|$)

After this you can just use Matcher, find all store tokens somewhere, lets say in List.

Example:

String data = "It is fun \"to write\" regular\"expression";
List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s\"]+|\"[^\"]*(\"|$)");
Matcher regexMatcher = regex.matcher(data);
while (regexMatcher.find()) {
    System.out.println(regexMatcher.group());
    matchList.add(regexMatcher.group());
}

Output:

It
is
fun
"to write"
regular
"expression

More complex expression to handle handle this data can look like

String data = "It is fun \"to write\" regular \"expression";
for(String s : data.split("(?<!\\G)(?<=\\G[^\"]*(\"[^\"]{0,100000}\")?[^\"]*)((?<=\"(?!\\s))|\\s+|(?=\"))"))
    System.out.println(s);

but this approach is way overcomplicated then writing your own parser.


Such parser could look like

public static List<String> parse(String data) {
    List<String> tokens = new ArrayList<String>();
    StringBuilder sb = new StringBuilder();
    boolean insideQuote = false;
    char previous = '\0';

    for (char ch : data.toCharArray()) {
        if (ch == ' ' && !insideQuote) {
            if (sb.length() > 0 && previous != '"')
                addTokenAndResetBuilder(sb, tokens);
        } else if (ch == '"') {
            if (insideQuote) {
                sb.append(ch);
                addTokenAndResetBuilder(sb, tokens);
            } else {
                addTokenAndResetBuilder(sb, tokens);
                sb.append(ch);
            }
            insideQuote = !insideQuote;
        } else {
            sb.append(ch);
        }
        previous = ch;
    }
    addTokenAndResetBuilder(sb, tokens);

    return tokens;
}

private static void addTokenAndResetBuilder(StringBuilder sb, List<String> list) {
    if (sb.length() > 0) {
        list.add(sb.toString());
        sb.delete(0, sb.length());
    }
}

Usage

String data = "It is fun \"to write\" regular\"expression\"xxx\"yyy";
for (String s : parse(data))
    System.out.println(s);
Community
  • 1
  • 1
Pshemo
  • 122,468
  • 25
  • 185
  • 269
  • +1 gotta give you an upvote, since it works, but come on... let's try to find a split regex. It's a good exercise. – Bohemian Mar 14 '14 at 23:31
1

You are running into a fundamental limitation of regular expressions here. In general they can't detect recursion, depth, etc.

So in your string:

"It is fun \"to write\" regular\"expression"

Both the space between to and write and the space between \" and regular are all inside quote marks. Regex is not able to "count" the number of quotes in a flexible way and take action based on it.

You will need to write your own string parser for this (or use an existing one). Regex can't handle it though.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
Tim B
  • 40,716
  • 16
  • 83
  • 128
1

The trick is to use a flexible look ahead to assert that:

  • if there's an even number of quotes in the input, there should be an even number following the space, because an odd number means the space is within quotes
  • if there's an odd number of quotes in the input, there should be an odd number following the space, because an even number means the space is within quotes

I got it into one line, but it's a whopper:

String[] parts = str.split("(\\s+|(?<!\\s)(?=\"))(?=(([^\"]*\"){2})*[^\"]*"
            + (str.matches("(([^\"]*\"){2})*[^\"]*") ? "" : "\"[^\"]*") + "$)");

This correctly splits the example string with or without the trailing quote (whether or not the trailing term includes a space).

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • You are missing `)` at the end. Also it seems that this regex splits on every space inside quote, and doesn't take care of case where split should be made on `"` which has no space before or after it. – Pshemo Mar 14 '14 at 23:09
  • Problem with OP example is that last quote doesn't have to be closed. – Pshemo Mar 14 '14 at 23:11
  • Easiest way would be using lookbehind to check number of `"` but unfortunately (again) Java requires look behind expression to have max length. – Pshemo Mar 14 '14 at 23:22
  • @Pshemo yeah - now that I have a java IDE to check, this one is proving tricky. I'm not giving up yet! I'll let you know if I crack it. – Bohemian Mar 14 '14 at 23:29
  • @Pshemo I think I cracked it, but it's little ugly. I had to partially build the regex to suit the input. See what you think. – Bohemian Mar 15 '14 at 04:56
  • It works so +1. Also it doesn't have to use tricks for max lenght in look-behind like `{0,x}` so this split is safer than mine. Only advantage of mine solution is that it will not traverse to beginning of string every time, but just to end of last match. Anyway I believe that best solution here is writing own parser which will be able to split this data in one iteration (like the one from my answer). – Pshemo Mar 15 '14 at 15:43