3

I have a need to do some data-transformation for data load compatibility. The nested key:value pairs need to be flattened and have their group id prepended to each piece of child data.

I've been trying to understand the page at Repeating a Capturing Group vs. Capturing a Repeated Group but can't seem to wrap my head around it.

My expression so far:

"(?'group'[\w]+)": {\n((\s*"(?'key'[^"]+)": "(?'value'[^"]+)"(?:,\n)?)+)\n},?

Working sample: https://regex101.com/r/Wobej7/1

I'm aware that using 1 or more intermediate steps would simplify the process but at this point I want to know if it's even possible.

Source Data Example:

"g1": {
  "k1": "v1",
  "k2": "v2",
  "k3": "v3"
},
"g2": {
  "k4": "v4",
  "k5": "v5",
  "k6": "v6"
},
"g3": {
  "k7": "v7",
  "k8": "v8",
  "k9": "v9"
}

Desired transformation:

{"g1","k1","v1"},
{"g1","k2","v2"},
{"g1","k3","v3"},
{"g2","k4","v4"},
{"g2","k5","v5"},
{"g2","k6","v6"},
{"g3","k7","v7"},
{"g3","k8","v8"},
{"g3","k9","v9"}
rumpled
  • 43
  • 5
  • Where are you using the regex? If in Notepad++, you might use `^("(\w+)":\h*{\h*)(?:\R\h+"(\w+)":\h*"(\w+)",?|\s*\}(?:,\R)?)` and replace with `(?{3}\{"$2","$3","$4"\},\n$1:)`, but you will have to click *Replace all* several times. – Wiktor Stribiżew Mar 10 '18 at 18:57
  • I've been using it in Sublime Text. I tested your solution in N++ and while it solves for the end solution, it doesn't capture more than one child at a time. The reason I posted on Stack Overflow is really to see if someone can help me understand repeating nested capture groups but thank you! – rumpled Mar 10 '18 at 19:17
  • As I'm aware it's not possible in one single step. At least you have to go with two regular expressions which means one more mouse click. – revo Mar 10 '18 at 19:18
  • I'm not sure I see where it could be done in even 2 steps. One thing to clarify is that the groups in the real application do not have an even number of data, it's all different from 1-15 k:v pairs. – rumpled Mar 10 '18 at 19:27
  • @Rumpled In SublimeText, you still might get it to work, perhaps, with 2 steps. However, you should precise the format. What is the real format of the input string? Regarding repeated capturing groups, you cannot work with them in text editors and you can only work with them in few programming languages. – Wiktor Stribiżew Mar 10 '18 at 21:01
  • [Here an idea for .NET regex in one step](http://www.regexstorm.net/tester?p=%28%3f%3a%5e%22%5cw%2b%22%3a+%7b%5cr%3f%5cn%29%3f%28%3f%3c%3d%28%22%5cw%2b%22%29%3a+%7b%5b%5e%7b%5d*%29++%28%22%5cw%2b%22%29%3a+%28%22%5cw%2b%22%29%2c%3f%28%3f%3a%5cr%3f%5cn%7d%2c%3f%29%3f&i=%22g1%22%3a+%7b%0d%0a++%22k1%22%3a+%22v1%22%2c%0d%0a++%22k2%22%3a+%22v2%22%2c%0d%0a++%22k3%22%3a+%22v3%22%0d%0a%7d%0d%0a%22g2%22%3a+%7b%0d%0a++%22k4%22%3a+%22v4%22%2c%0d%0a%7d%2c&r=%7b%241%2c%242%2c%243%7d&o=m) (click on "Context") but that probably won't help you (: – bobble bubble Mar 10 '18 at 21:58
  • Did below answer work? If not I may be able to improve or remove it if wasn't helpful. – revo Mar 14 '18 at 15:09
  • @revo Sorry, busy work week. While your solution doesn't fulfill 100% of the reqs, it absolutely helps accomplish what I was looking, pure regex. I see the positive look ahead and behind and it respects group boundaries which is good since there's not always an equal number of elements per group. I also found [this question](https://stackoverflow.com/questions/15268504/collapse-and-capture-a-repeating-pattern-in-a-single-regex-expression) which leads to [this pattern](https://regex101.com/r/tA5xK0/1) as an example. I'll have more time to work on it soon. Please be patient with me! – rumpled Mar 15 '18 at 01:26

1 Answers1

0

TL; DR

Step 1

Search for:

("[^"]+"):\s*{[^}]*},?\K

Replace with \1

Live demo

Step 2

Search for:

(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)

Replace with:

{\3,\1,\2}\4\n

Live demo

Whole philosophy

This is not going to be a one-liner regex solution for different reasons. The most important one is we can neither store a part of a match for later referring nor are able to do infinite lookbehinds in PCRE. But fortunately most of similar problems could be done in two steps.

Very first step should be moving group name to end of {...} block. This way we can have group name each time we want to transform our matches into a single line output.

("[^"]+"):\s*{[^}]*},?\K
  • ( Start of capturing group #1
    • "[^"]+" Match a group name
  • ) End of CG #1
  • :\s*{ Group name should precede bunch of other characters
  • [^}]*},? We have to go further up to end of block
  • \K Throw away every thing matched so far

We have our group name held in first capturing group and have to replace whole match with it:

\1

Now a block like this:

"g1": {
  .
  .
  .
},

Appears like this one:

"g1": {
  .
  .
  .
},"g1"

Next step is to match key:value pairs of each block beside capturing recent added group name at the end of block.

(?:"[^"]+":\s*{|\G(?!\A))\s*("[^"]+"):\s*((?1))(?=[^}]*},?((?1)))(?|(,)|\s*}(,?).*\R*)
  • (?: Start of a non-capturing group
    • "[^"]+" Try to match a group name
    • :\s*{ A group name should come after bunch of other characters
    • | Or
    • \G(?!\A) Continue from previous match
  • ) End of NCG
  • \s*("[^"]+"):\s*((?1)) Then try to match and capture a key:value pair
  • (?=[^}]*},?((?1))) Simultaneously match and capture group name at the end of block
  • (?|(,)|\s*}(,?).*\R*) Match remaining characters such as commas, brace or newlines

This way in each single successful try of regex engine we have four captured data that their order is the key:

{\3,\1,\2}\4\n
  • \3 Group name (that one added at the end of block)
  • \1 Key
  • \2 Value
  • \4 Comma (may be there or may not)
Community
  • 1
  • 1
revo
  • 47,783
  • 14
  • 74
  • 117