0

I am trying to parse a query which I need to modify to replace a specific property and its value with another property and different values. I am struggling to write a regex that will match the specify property and its value that I need.

Here are some examples to illustrate my point. test:property is the property name that we need to match.

  1. Property with a single value: test:property:schema:Person
  2. Property with multiple values (there is no limit on how many values there can be - this example uses 3): test:property:(schema:Person OR schema:Organization OR schema:Place)
  3. Property with a single value in brackets: test:property:(schema:Person)
  4. Property with another property in the query string (i.e. there are other parts of the string that I'm not interested in): test:property:schema:Person test:otherProperty:anotherValue

Also note that other combinations are possible such as other properties being before the property I need to capture, my property having multiple values with another property present in the query.

I want to match on the entire test:property section with each value captured within that match. Given the examples above these are the results I am looking for:

# Match Groups
1 test:property:schema:Person schema:Person
2 test:property:(schema:Person OR schema:Organization OR schema:Place) schema:Person
schema:Organization
schema:Person
3 test:property:(schema:Person) schema:Person
4 test:property:schema:Person schema:Person

Note: #1 and #4 produce the same output. I wanted to illustrate that the rest of the string should be ignored (I only need to change the test:property key and value).

The pattern of schema:Person is defined as \w+\:\w+, i.e. one or more word characters, followed by a colon, followed by one or more word characters.

If we define the known parts of the string with names I think I can express what I want to match.

  • schema:Person - <TypeName> - note that the first part, schema in this case, is not fixed and can be different
  • test:property - <MatchProperty>
<MatchProperty>: // property name (which is known and the same - in the examples this is `test:property`) followed by a colon
  ( // optional open bracket
    <TypeName>
    (OR <TypeName>)* // optional additional TypeNames separated by an OR
  ) // optional close bracket

Every example I've found has had simple alphanumeric characters in the repeating section but my repeating pattern contains the colon which seems to be tripping me up. The closest I've got is this:

(test\:property:(?:\(([\w+\:\w+]+ [OR [\w+\:\w+]+)\))|[\w+\:\w+]+)

Which works okayish when there are no other properties (although the match for example #2 contains the entire property and value as the first group result, and a second group with the property value) but goes crazy when other properties are included.

Also, putting that regex through https://regex101.com/ I know it's not right as the backslash characters in the square brackets are being matched exactly. I started to have a go with capturing and non-capturing groups but got as far as this before giving up!

(?:(\w+\:\w+))(?:(\sOR\s))*(?:(\w+\:\w+))*
Stuart Leyland-Cole
  • 1,243
  • 7
  • 19
  • 35
  • Are there at most 3 “groups”? – Bohemian Sep 14 '21 at 17:37
  • So your `` isn't always `test:property`? And the first part of your `` isn't always `schema`? Can you add a specific language please? .NET for example can capture more than just the last occurrence in a capturing group. – Scratte Sep 14 '21 at 17:40
  • Test case #1 and #4 are the same!? – Bohemian Sep 14 '21 at 17:43
  • 1
    The number of groups is set within the pattern. I suggest capturing the whole part after `test:property` and then splitting the captured value with space + `OR` + space. See `\btest:property:(\()?(\w+:\w+(?:\s+OR\s+\w+:\w+)*)(?(1)\))` [regex demo](https://regex101.com/r/7pTnHG/2) – Wiktor Stribiżew Sep 14 '21 at 18:05
  • I have updated the question to answer all of the questions raised in these comments, I hope that adds some clarity. Briefly: there are an unlimited number of groups, `` is a known value, the first part of `` is not always `schema`, and I've clarified the reason for examples #1 and #4 – Stuart Leyland-Cole Sep 14 '21 at 22:28

1 Answers1

0

This isn't a complete solution if you want pure regex because there are some limitations to regex and Java regex in particular, but the regexes I came up with seem to work.

If you're looking to match the entire sequence, the following regex will work.

test:property:(?:\((\w+:\w+)(?:\sOR\s(\w+:\w+))*\)|(\w+:\w+))

Unfortunately, the repeated capture groups will only capture the last match, so in queries with multiple values (like example 2), groups 1 and 2 will be the first and last values (schema:Person and schema:Place). In queries without parentheses, the value will be in group 3.

If you know the maximum number of values, you could just generate a massive regex that will have enough groups, but this might not be ideal depending on your application.

The other regex to find values in groups of arbitrary length uses regex's positive lookbehind to match valid values. You can then generate an array of matches.

(?<=test:property:(?:(?:\((?:\w+:\w+\sOR\s)+)|\(?))\w+:\w+

The issue with this method is that it looks like Java lookbehind has some limitations, specifically, not allowing unbound or complex quantifiers. I'm not a Java person so I haven't tried things out for myself, but it seems like this wouldn't work either. If someone else has another solution, please post another answer!

With this in mind, I would probably suggest going with a combination regex + string parsing method. You can use regex to parse out the value or multiple values (separated by OR), then split the string to get your final values.

To match the entire part inside parentheses or the single value no parentheses, you can use this regex:

test:property:(?:\((\w+:\w+(?:\sOR\s\w+:\w+)*)\)|(\w+:\w+))

It's still split into two groups where one matches values with parentheses and the other matches values without (to avoid matching unpaired parentheses), but it should be usable.

If you want to play around with these regexes or learn more, here's a regexr: https://regexr.com/65kma

person_v1.32
  • 2,641
  • 1
  • 14
  • 27