0

I am currently writing a java program with regular expression but I am struggling as I am pretty new in regex.

KEY_EXPRESSION = "[a-zA-z0-9]+";
VALUE_EXPRESSION = "[a-zA-Z0-9\\*\\+,%_\\-!@#\\$\\^=<>\\.\\?';:\\|~`&\\{\\}\\[\\]/ ]*";
CHUNK_EXPRESSION = "(" + KEY_EXPRESSION + ")\\((" + VALUE_EXPRESSION + ")\\)";

The target syntax is key(value)+key(value)+key(value). Key is alphanumeric and value is allowed to be any combination.

This has been okay so far. However, I have a problem with '(', ')' in value. If I place '(' or ')' in the value, value includes all the rest.

e.g. number(abc(kk)123)+status(open) returns key:number, value:abc(kk)123)+status(open
It is supposed to be two pairs of key-value.

Can you guys suggest to improve the expression above?

Unihedron
  • 10,902
  • 13
  • 62
  • 72
Chris
  • 199
  • 2
  • 13
  • 3
    ... So what are you trying to do? Also, take a [tour]. – Unihedron Aug 18 '14 at 06:48
  • Someone posted an answer with a working solution regex: ([a-zA-z0-9]+)\((.*?)\)(?=\+|$) - This works great. When I tested on online regex tester site and came back, the post had gone. Is it right solution? I am wondering why the answer has been deleted. – Chris Aug 18 '14 at 07:56
  • Just want to confirm whether this is a working solution. I thought the poster deleted this as this may not be the right solution. Anyway, I will post this as an answer. Thanks! – Chris Aug 18 '14 at 08:02
  • Does value contains `)+` somewhere? – Braj Aug 18 '14 at 08:06
  • 1
    It is actually user input. So, it could be any characters. However, I do not think user types that particular combination. This is reasonable solution. Thanks! – Chris Aug 18 '14 at 08:11
  • Please refer to one existing problem. FYI. https://stackoverflow.com/questions/25204979/regex-trouble-matching-the-pattern-cmd/25217934#25217934 – jawee Aug 18 '14 at 08:34
  • 1
    You _do not think_ the user will type that combination? You should never assume such things. I’m confused that that comment even got an upvote. – Michael Piefel Aug 18 '14 at 10:07

4 Answers4

2

Not possible with regular expressions at all, sorry. If you want to count opening and closing parantheses, regular expressions are, in general, not good enough. The language you are trying to parse is not a regular language.

Of course, there may be ways around that limitation. We cannot know that if you give us as little context as you did.

Michael Piefel
  • 18,660
  • 9
  • 81
  • 112
1

Get the matched group from index 1 and 2

([a-zA-Z0-9]+)\((.*?)\)(?=\+|$)

Here is online demo

The above regex pattern looks of for )+ as delimiter between keys and values.

Note: The above regex pattern will not work if value contains )+ for example number(abc(kk)+123+4+4)+status(open)

enter image description here

Sample code:

String str = "number(abc(kk)123)+status(open)";
Pattern p = Pattern.compile("([a-zA-Z0-9]+)\\((.*?)\\)(?=\\+|$)");
Matcher m = p.matcher(str);
while (m.find()) {
    System.out.println(m.group(1) + ":" + m.group(2));
}

output:

number:abc(kk)123
status:open
Braj
  • 46,415
  • 5
  • 60
  • 76
1

Someone posted an answer with a working solution regex: ([a-zA-z0-9]+)\((.*?)\)(?=\+|$) - This works great. When I tested on online regex tester site and came back, the post had gone. Is it right solution? I am wondering why the answer has been deleted.

See this golfed regex:

([^\W_]+)\((.*?)\)(?![^+])
  • You can use a shorthanded character class [^\W_] instead of [a-zA-Z0-9].
  • You can use a negative lookahead assertion (?![^+]) to match without backtracking.

However, this is not a practical solution as )+ within inner elements will break: number(abc(kk)+5+123+4+4)+status(open)

This is the case where Java, which has the regex implementation that doesn't support recursion, is disadvantaged. As I mentioned in this thread, the practical approach would be to use a workaround (copy-paste regex), or build your own finite state machine to parse it.

Also, you have a typographical error in your original regex. [a-zA-z0-9]+ has a range "A-z". You meant to type "A-Z".

Community
  • 1
  • 1
Unihedron
  • 10,902
  • 13
  • 62
  • 72
0

I'll do a little assumption that you're able to add a + at the end of your chunk i.e. number(abc(kk)123)+status(open)+

If it is possible you'll have it work with:

KEY_EXPRESSION = "[a-zA-z0-9]+";
VALUE_EXPRESSION = "[a-zA-Z0-9\\*\\+,%_\\-!@#\\$\\^=<>\\.\\?';:\\|~`&\\{\\}\\[\\]\\(\\)/ ]*?";
CHUNK_EXPRESSION = "(" + KEY_EXPRESSION + ")\\((" + VALUE_EXPRESSION + ")\\)+";

The changes are on line 2 adding the ( ) with escaping and replacing * by *?

The ? turn off the greedy matching and try to keep the shortest match (reluctant operator).

On line 3 adding a + at the end of the mask to help separate the key(value) fields.

Tensibai
  • 15,557
  • 1
  • 37
  • 57