-4

I have a string like this below:

"[name:pizza,quantity:12,unit_price:3.43],[name:burger,unit_price:40.47,quantity:24],[quantity:4,unit_price:14.47,name:hotdog]";

Here name, quantity and unit_price not in a sequence for every group. I want to apply regex on this string and extract data like below:

  1. Group one: name- pizza, quantity- 12, unit_price- 3.43
  2. Group two: name- burger, quantity- 24, unit_price- 40.47
  3. ... I have tried this pattern so far:

    (\{(?:name:(?<name>[a-zA-Z\s]+))|(?:amount:(?<amount>[0-9]+))|(?:unit_price:(?<unitPrice>[0-9]+.?[0-9]*))\})
    

But I don't know how to extract every nested group. And I also think this pattern is not right for this following string.

Now how do I do this in java in pure regex without splitting and iterating?

Dean Taylor
  • 40,514
  • 3
  • 31
  • 50

1 Answers1

2

The following assumes:

  • there are no additional characters before the initial [ or after the last ].
  • The value of name, quantity etc contain neither , or ].

Regular Expression

\G\[(?:(?:name:(?<name>[^,\]]+)|quantity:(?<quantity>[^,\]]+)|unit_price:(?<unit_price>[^,\]]+)),?)*\](?:,|\z)

https://regex101.com/r/gW2cL1/1

Visualisation

RegEx Visualisation

Code

try {
    Pattern regex = Pattern.compile("\\G\\[(?:(?:name:(?<name>[^,\\]]+)|quantity:(?<quantity>[^,\\]]+)|unit_price:(?<unit_price>[^,\\]]+)),?)*\\](?:,|\\z)", Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE | Pattern.DOTALL);
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        // matched text for "name": regexMatcher.group("name")
        // matched text for "quantity": regexMatcher.group("quantity")
        // matched text for "unit_price": regexMatcher.group("unit_price")
    } 
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

Human Readable

// \G\[(?:(?:name:(?<name>[^,\]]+)|quantity:(?<quantity>[^,\]]+)|unit_price:(?<unit_price>[^,\]]+)),?)*\](?:,|\z)
// 
// Options: Case insensitive; Exact spacing; Dot matches line breaks; ^$ don’t match at line breaks; Default line breaks
// 
// Assert position at the end of the previous match (the start of the string for the first attempt) «\G»
// Match the character “[” literally «\[»
// Match the regular expression below «(?:(?:name:(?<name>[^,\]]+)|quantity:(?<quantity>[^,\]]+)|unit_price:(?<unit_price>[^,\]]+)),?)*»
//    Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//    Match the regular expression below «(?:name:(?<name>[^,\]]+)|quantity:(?<quantity>[^,\]]+)|unit_price:(?<unit_price>[^,\]]+))»
//       Match this alternative (attempting the next alternative only if this one fails) «name:(?<name>[^,\]]+)»
//          Match the character string “name:” literally (case insensitive) «name:»
//          Match the regex below and capture its match into a backreference named “name” (also backreference number 1) «(?<name>[^,\]]+)»
//             Match any single character NOT present in the list below «[^,\]]+»
//                Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//                The literal character “,” «,»
//                The literal character “]” «\]»
//       Or match this alternative (attempting the next alternative only if this one fails) «quantity:(?<quantity>[^,\]]+)»
//          Match the character string “quantity:” literally (case insensitive) «quantity:»
//          Match the regex below and capture its match into a backreference named “quantity” (also backreference number 2) «(?<quantity>[^,\]]+)»
//             Match any single character NOT present in the list below «[^,\]]+»
//                Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//                The literal character “,” «,»
//                The literal character “]” «\]»
//       Or match this alternative (the entire group fails if this one fails to match) «unit_price:(?<unit_price>[^,\]]+)»
//          Match the character string “unit_price:” literally (case insensitive) «unit_price:»
//          Match the regex below and capture its match into a backreference named “unit_price” (also backreference number 3) «(?<unit_price>[^,\]]+)»
//             Match any single character NOT present in the list below «[^,\]]+»
//                Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
//                The literal character “,” «,»
//                The literal character “]” «\]»
//    Match the character “,” literally «,?»
//       Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
// Match the character “]” literally «\]»
// Match the regular expression below «(?:,|\z)»
//    Match this alternative (attempting the next alternative only if this one fails) «,»
//       Match the character “,” literally «,»
//    Or match this alternative (the entire group fails if this one fails to match) «\z»
//       Assert position at the very end of the string «\z»

Additional Notes

You didn't specify any specific rules for the type of data name or any of the other values could be.

So "one or more characters that is not , and not ]" was the obvious choice.

To be more specific adjust the regular expression to capture only the data you want in place of the [^,\]]+ elements.

So for name you might use [a-z]{3,10} to match a through z without any spaces between 3 and 10 times. The name value won't be captured if it doesn't match.

As already mentioned in comments changing ,? for (?:,|(?=\]) will ensure you match a , or ] at the end of each item.

Dean Taylor
  • 40,514
  • 3
  • 31
  • 50
  • The optional `,` would allow garbage input such as `[name:pizzaquantity:12,unit_price:3.43]`. The standard solution is `token (separator token)*` If you don't like repeating the portion of the regex, just build the regex with string concat. By the way, a syntax explanation of a regex is almost always useless, unless the pattern is extremely simple. – nhahtdh Dec 21 '15 at 11:39
  • Because of the `\G` anchor point the it simply captures `pizzaquantity:12` as the `name` value - which it could very well be - a non-issue. Garbage in garbage out. If it was a worry I would switch it for a simple lookahead changing `,?` for `(?:,|(?=\])`. Personally I always find syntax outline useful - to each their own. – Dean Taylor Dec 21 '15 at 11:50
  • nice graph (and nice site): how do you get it ? – guillaume girod-vitouchkina Dec 21 '15 at 12:00
  • @DeanTaylor: My aim is to make the regex do garbage in, nothing out instead. If it's possible to exclude the error, then it's better to do it here instead of post-process the result or use garbage data. – nhahtdh Dec 22 '15 at 02:52
  • Added some additional notes. – Dean Taylor Dec 22 '15 at 07:50