2

i am facing issues while trying to cut out a substring from a string using python regex. the problem statement is that i want to take any substring matching the following format from a bigger string

some_var:struct<some_variables>

In doing so, i got into three corner case scenarios and let me explain those scenarios in details

Scenario1 :-

s='firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,addr:string'
match = re.search(r'\w[a-zA-Z]*:struct<.*>,',s)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,

the above code works fine.

Scenario2:-

subdtyp = 'firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,last3:array<string>,last4:struct<last41:int,last42:string>'
>>> match = re.search(r'\w[a-zA-Z]*:struct<.*>,',subdtyp)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,last3:array<string>,

in this case on using the above regex format, due to greedy matching i am getting a string which is more than what is expected (last3:array<string>,) is the extra bit of information that is coming. So i changed that to non-greedy matching like below

>>> match = re.search(r'\w[a-zA-Z]*:struct<.*?>,',subdtyp)
>>> print(match.group())
lastname:struct<last1:int,last2:array<string>>,

this time the result is coming fine and what i want

Scenario 3 :-

subdtyp2 = 'firstname:string,middlename:double,lastname:struct<last4:struct<last41:int,last42:string>,last2:array<string>>,last3:array<string>'
>>> match = re.search(r'\w[a-zA-Z]*:struct<.*?>,',subdtyp2)
>>> print(match.group())
lastname:struct<last4:struct<last41:int,last42:string>,

here we are not getting the completed result as (last2:array<string>) portion is missed out for non-greedy matching.

Can somebody please help me in providing me a regex which will satisfy all the above conditions ?

Sam Mason
  • 15,216
  • 1
  • 41
  • 60
  • According to [this answer](https://stackoverflow.com/a/5454510/16521194), regex woulld not be the best way to handle nested expressions. A better way would be `pyparsing`. – GregoirePelegrin Dec 13 '22 at 07:26

1 Answers1

0

Starting from this answer, I get something like this:

import pyparsing

string = 'firstname:string,middlename:double,lastname:struct<last1:int,last2:array<string>>,addr:string'
thecontent = pyparsing.Word(pyparsing.alphanums) | ":" | ","
parens = pyparsing.nestedExpr("<", ">", content=f"<{thecontent}>")

a = parens.parseString(string).asList()[0]
print(a[a.index('struct')+1])

# ['last1', ':', 'int', ',', 'last2', ':', 'array', ['string']]

We must define thecontent as every character other than the nesting ones, while here parens are the nesting ones. Additionally, like in JSON, you can't start from something else than a nesting character, thus why the content=f"<{thecontent}>".
As far as I've understood, you want to find the content of the structs, this should allow you to do exactly this.

GregoirePelegrin
  • 1,206
  • 2
  • 7
  • 23