5

I'm trying to capture key-value pairs from strings that have the following form:

a0=d235 a1=2314 com1="abcd" com2="a b c d"

Using help from this post, I was able to write the following regex that captures the key-value pairs:

Pattern.compile("(\\w*)=(\"[^\"]*\"|[^\\s]*)");

The problem is that the second group in this pattern also captures the quotation marks, as follows:

a0=d235
a1=2314
com1="abcd"
com2="a b c d"

How do I exclude the quotation marks? I want something like this:

a0=d235
a1=2314
com1=abcd
com2=a b c d

EDIT:

It is possible to achieve the above by capturing the value in different groups depending on whether there are quotation marks or not. I'm writing this code for a parser so for performance reasons I'm trying to come up with a regex that can return the value in the same group number.

Community
  • 1
  • 1
Dawood
  • 5,106
  • 4
  • 23
  • 27

2 Answers2

10

How about this? The idea is to split the last group into 2 groups.

Pattern p = Pattern.compile("(\\w+)=\"([^\"]+)\"|([^\\s]+)");

String test = "a0=d235 a1=2314 com1=\"abcd\" com2=\"a b c d\"";
Matcher m = p.matcher(test);

while(m.find()){
    System.out.print(m.group(1));
    System.out.print("=");
    System.out.print(m.group(2) == null ? m.group(3):m.group(2));
    System.out.println();
}

Update

Here is a new solution in response to the updated question. This regex applies positive look-ahead and look-behind to make sure there is a quote without actually parsing it. This way, groups 2 and 3 above, can be put in the same group (group 2 below). There is no way to exclude the quotes by while returning group 0.

Pattern p = Pattern.compile("(\\w+)=\"*((?<=\")[^\"]+(?=\")|([^\\s]+))\"*");

String test = "a0=d235 a1=2314 com1=\"abcd\" com2=\"a b c d\"";
Matcher m = p.matcher(test);

while(m.find()){
    print m.group(1);
    print "="
    println m.group(2);
}

Output

a0=d235
a1=2314
com1=abcd
com2=a b c d
user845279
  • 2,794
  • 1
  • 20
  • 38
  • This is similar to @burning_LEGION's answer. I've just made an edit to my question; is it possible to capture them in the same group? – Dawood Jul 13 '12 at 21:38
  • No, not all in one expression. You would have to get rid of the quotation marks in every one of the right-side groups. See here: http://stackoverflow.com/questions/277547/regular-expression-to-skip-character-in-capture-group – VolatileRig Jul 13 '12 at 22:56
  • @Dawood It is possible to capture quoted and unquoted strings in a single group while excluding the quotes but there is no way to capture everything (group 0) while excluding quotes. – user845279 Jul 13 '12 at 23:27
  • @user845279: this works... thanks! The lookahead and lookbehind constructs are pretty useful but I haven't quite gotten the hang of them yet. – Dawood Jul 13 '12 at 23:52
  • 1
    Wow, I didn't think it was possible, but your new update works really well! However, you do want to add a non-capturing clause, because right now you're keeping 3 groups. Here's an update on yours: `Pattern.compile("(\\w+)=\"*((?<=\")[^\"]+(?=\")|(?:[^\\s]+))\"*");` – VolatileRig Jul 14 '12 at 17:03
0

use this regex (\w+)=(("(.+?)")|(.+?)(?=\s|$)) key and value contain in regex groups

burning_LEGION
  • 13,246
  • 8
  • 40
  • 52
  • I tried something similar but since I'm writing this code for a parser, I'm trying to avoid checking groups separately since it will affect performance. Your code will store the value in different groups depending on whether there were quotation marks or not. Is there a way to store it in the same group? – Dawood Jul 13 '12 at 21:33
  • Could you explain what's the meaning of `( .+?)`? – NeoZoom.lua Apr 14 '19 at 11:42
  • 1
    @LiSeeLeiCow-Q__Q it catches all symbols before " same as ([^"]+)" – burning_LEGION Apr 14 '19 at 12:02