1

I'm looking for the correct regex to provide me the following results:

  • it needs to group words surrounded by single/double quote
  • it needs to keep printing the single quote when there's no other single quote in the string
  • when not surrounded by single/double quotes - split on space

I currently have:

Pattern pattern = Pattern.compile("[^\\s\"']+|\"([^\"]*)\"|'([^']*)'");

... but the following examples are not completely working. Who can help me with this one?

Examples:

  • foo bar
    • group1: foo
    • group2: bar
    • description: split on space
  • "foo bar"
    • group1: foo bar
    • description: surrounded by double quotes so group foo and bar, but don't print double quotes
  • 'foo bar'
    • group1: foo bar
    • description: same as above, but with single quotes
  • 'foo bar
    • group1: 'foo
    • group2: bar
    • description: split on space and keep single quote
  • "'foo bar"
    • group1: 'foo bar
    • description: surrounded by double quotes so group 'foo and bar and keep single quote
  • foo bar'
    • group1: foo
    • group2: bar'
  • foo bar"
    • group1: foo
    • group2: bar"
  • "foo bar" "stack overflow"
    • group1: foo bar
    • group2: stack overflow
  • "foo' bar" "stack overflow" how do you do
    • group1: foo' bar
    • group2: stack overflow
    • group3: how
    • group4: do
    • group5: you
    • group6: do
Jochen Hebbrecht
  • 733
  • 2
  • 9
  • 23
  • I posted one [here](http://stackoverflow.com/a/8036736/823393) which may be a good start. It does not handle single quote and splits on commas instead of spaces but it may be a good start. One benefit is that there is a narrative on how it actually works. – OldCurmudgeon Oct 05 '12 at 08:44
  • Thanks, but user Keppil gave me the correct solution :-) – Jochen Hebbrecht Oct 05 '12 at 09:23
  • Keppil's solution covers your test cases but note that it will not allow for cases such as "A string with ""quotes"" in it" but if you do not need that then it's good to know you have an answer. – OldCurmudgeon Oct 05 '12 at 11:16

2 Answers2

7

I'm not sure if you can do this in one Matcher.match call, but you can do it with a loop.
This code piece solves all the cases you mention above by using Matcher.find() repeatedly:

Pattern pattern = Pattern.compile("\"([^\"]+)\"|'([^']+)'|\\S+");
List<String> testStrings = Arrays.asList("foo bar", "\"foo bar\"","'foo bar'", "'foo bar", "\"'foo bar\"", "foo bar'", "foo bar\"", "\"foo bar\" \"stack overflow\"", "\"foo' bar\" \"stack overflow\" how do you do");
for (String testString : testStrings) {
    int count = 1;
    Matcher matcher = pattern.matcher(testString);
    System.out.format("* %s%n", testString);
    while (matcher.find()) {
        System.out.format("\t* group%d: %s%n", count++, matcher.group(1) == null ? matcher.group(2) == null ? matcher.group() : matcher.group(2) : matcher.group(1));
    }
}

This prints:

* foo bar
    * group1: foo
    * group2: bar
* "foo bar"
    * group1: foo bar
* 'foo bar'
    * group1: foo bar
* 'foo bar
    * group1: 'foo
    * group2: bar
* "'foo bar"
    * group1: 'foo bar
* foo bar'
    * group1: foo
    * group2: bar'
* foo bar"
    * group1: foo
    * group2: bar"
* "foo bar" "stack overflow"
    * group1: foo bar
    * group2: stack overflow
* "foo' bar" "stack overflow" how do you do
    * group1: foo' bar
    * group2: stack overflow
    * group3: how
    * group4: do
    * group5: you
    * group6: do
Keppil
  • 45,603
  • 8
  • 97
  • 119
  • 1
    That's it! You're a genious! :-). Thank you for helping me on this one – Jochen Hebbrecht Oct 05 '12 at 09:10
  • Tested it here with no problem: http://www.regexplanet.com/advanced/java/index.html (you have to remove the Java string escaping) – dan1111 Oct 05 '12 at 09:14
  • @SJuan76, I was responding to your original comment. Your new example arguably goes outside the scope of the requirements in the question. – dan1111 Oct 05 '12 at 09:20
  • @dan1111 you are right... a better example is foo'"bar slashdot" – SJuan76 Oct 05 '12 at 09:21
  • @dan1111 I agree I should have not edited the comment, I saw my original example (for the record, `'foo "bar hello"`) was not a good answer before your reply and I wanted to shorten the discussion – SJuan76 Oct 05 '12 at 09:24
  • @dan1111 I disagree that my second example is out of scope. You can't just say "the examples that do not work are out of scope". Because grammars precisely extend the scope of regex to do what is needed. – SJuan76 Oct 05 '12 at 09:26
  • 1
    @SJuan76, as I understand it, the question assumes that each item, whether quoted or not, will be separated by spaces. Given that assumption, I believe this is a robust regex. I understand your general point about the dangers of using regexes to parse, but I don't see the problem in a fairly simple case like this one. – dan1111 Oct 05 '12 at 09:28
  • @SJuan76 your second example will not occur – Jochen Hebbrecht Oct 05 '12 at 09:52
1

Anytime you have pairings (let it be quotes, or braces) you leave the realm of regex and go into the realm of grammar, which need parsers.

I'll leave you with the ultimate answer to this question

UPDATE:

A little more explanation.

A grammar is usually expressed as:

construct -> [set of constructs or terminals]

For example, for quotes

doblequotedstring := " simplequotedstring "
simplequotedstring := string ' string
                      | string '
                      | ' string
                      | '

This is a simple example; there will be proper examples of grammars for quoting in the internet.

I have used aflex and ajacc for this (for Ada; in Java exist jflex and jjacc). You pass the list of identifiers to aflex, generate an output, pass that output and the grammar to ajacc and you get an Ada parser. Since it has been a lot of time since I used them, I do not know if there are more streamlined solutions but in the basic it will need the same input.

Community
  • 1
  • 1
SJuan76
  • 24,532
  • 6
  • 47
  • 87
  • I'm not trying to parse an HTML string in RegEx. I'm trying to group words and split them on single/double quotes and spaces. I guess those things are the reason we use ... regular expressions, no? There's no alternative in my opinion. – Jochen Hebbrecht Oct 05 '12 at 08:25
  • Read my comment. Matching quotes means that you are working with "proper" grammars (all regex are grammars, but not all grammars are regex; your example is not regex). Now read the linked answer. **There is no way to parse "proper" grammars with regex**. You may use regex for a subset of a grammar which happens to be a regex (detecting if a String begins and ends with quote, for example) but nothing else. It is a mathematical impossibility. – SJuan76 Oct 05 '12 at 08:31
  • And by the way, the answer I linked has that style and so many upvotes because, when you tell someone that he needs a grammar, the usual answer is "but I want to do it with a regex"... – SJuan76 Oct 05 '12 at 08:32
  • I have downvoted this answer, because it is unhelpful. It is perfectly reasonable to use a RegEx for this. There is a world of difference between this problem and parsing HTML. – dan1111 Oct 05 '12 at 08:33
  • 2
    @dan1111, be my guest and show me the regex. I will enjoy upvoting it, **if it works** – SJuan76 Oct 05 '12 at 08:34
  • @SJuan76: ok, but can you help me providing a solution to solve my problem? I don't mind it's not going to be done by regex, but can you give me an alternative? – Jochen Hebbrecht Oct 05 '12 at 08:41
  • @SJuan76: another user has provided me the solution. I'm going to downvote your answer as it is not useful for me – Jochen Hebbrecht Oct 05 '12 at 09:10
  • @JochenHebbrecht Don't downvote just because you cannot use it. The answer is in fact very useful! – Baz Oct 05 '12 at 09:11
  • @SJuan76: ... you are lucky, I don't have enough points to downvote :-D – Jochen Hebbrecht Oct 05 '12 at 09:11