1

I have an input that looks like this:

<ID>0<VAL>a1b<ID>1<VAL>a2b<ID>2<VAL>a3b<ID>3<VAL>a4b

I'd need to capture key-value pairs (e.g. id - val) or at least an array of groups as the following: [0, a1b, 1, a2b, 2, a3b, 3, a4b]

Capturing just one pair (i.e. when the input contains only a single pair) works with this:

(?>(?:<ID>(\d+))(?:<VAL>(.+)))?

the result being: [0, a1b].

But it doesn't work for multiple pairs - it captures 0 as a group then as a 2nd group it takes the rest of the input, excluding the first <VAL> tag, as in: [0, a1b<ID>1<VAL>a2b<ID>2<VAL>a3b<ID>3<VAL>a4b]

Can someone point me to a direction I should look into?

UPDATE: what if <ID> and <VAL> are some special chars, for example 0x8F and 0x9F?

Mike Spike
  • 389
  • 8
  • 20
  • 1
    Use multiple matching with `(\d+)([^<]+)` - https://regex101.com/r/6jWv6t/1 – Wiktor Stribiżew Jan 02 '23 at 10:46
  • If you test [your pattern on e.g. regex101](https://regex101.com/r/LMCuPG/1) you'll see that the [*greedy*](https://stackoverflow.com/questions/2301285/what-do-lazy-and-greedy-mean-in-the-context-of-regular-expressions) `.+` consumes all the rest from the first value match. Maybe you can use `\w` word-character for the `VAL`. With the optional group you could use [`(\d+)(?:(\w+))?`](https://regex101.com/r/owgong/1) – bobble bubble Jan 02 '23 at 12:31

2 Answers2

2

This regex matches keys and then values.

(?<=<ID>)(\d+)(?=<VAL>)|(?<=<VAL>)[a-z\d]*(?=<ID>)

There are 2 groups:

  • (?<=<ID>)(\d+)(?=<VAL>) matches a key \d+ between <ID> and <VAL> using positive lookbehind and lookahead
    • (?<=<ID>) is a positive lookbehind
    • (?=<VAL>) is a positive lookahead
  • (?<=<VAL>)[a-z\d]*(?=<ID>) matches a value between <VAL> and <ID> using positive lookbehind and lookahead
    • [a-z\d]* matches a value
    • (?<=<VAL>) is a positive lookbehind
    • (?=<ID>) is a positive lookahead

regex101.com

Albina
  • 1,901
  • 3
  • 7
  • 19
  • thank you for your answer, the current regex doesn't capture the values, also it doesn't capture the last value. A solution to that would be: `(?<=)(\d+)(?=)|(?<=)([a-z\d]*)(?=)?` - added `?` for the last group - created a capture group for values [a-z\d]* – Mike Spike Jan 11 '23 at 13:19
  • For hex delimiters, the regex would look similar to `(?<=\x8A)(\d+)(?=\x9A)|(?<=\x9A)([a-z\d]*)(?=\x8A?)`. Also, as it is now, it is quite restrictive, so in case there is need to capture more complex values (e.g. values that include '.' or ':'), just update the value capture: `([a-zA-Z\d\.\:\s]*)` – Mike Spike Jan 11 '23 at 14:15
1

@bobble-bubble's solutions is the most efficient (according to regex101): 4 matches in 72 steps and 1ms, but it's very restrictive. To fix this, the \w can be replaced with [a-z\d], then it becomes even faster: 4 matches in 72 steps and 0ms.

@WiktorStribiżew's solution is the next most efficient: 4 matches in 64 steps and 4ms.

@albina's solution is the least efficient: 7 matches in 153 steps and 10ms

Mike Spike
  • 389
  • 8
  • 20