Using Regex in Python to get array sizes

Question

I'm very, very new to regular expressions. I just picked it up about 3 hours ago, but I hit a snag, and I can't seem to shake it. So, as always, I turned to the internet to solve all my problems, and when it couldn't explain the answers to me, I searched on stackoverflow to see if someone else had asked my question, and finally just posted a new question when I couldn't answer it myself from browsing.

I'll dumb down what I'm trying to do a little bit because I've figured most of it out, but this one teeny weeny bit of it just isn't working the way I want it to, or at all actually, and the whole mess is complicated and hard to explain, but in the end, I have a whole bunch of strings I want to run a regex on.

So, in side a repeating loop, I pass a string which contains a variable name. Now, I'm having a hard time explaining what the variables may look like, so I am just going to list of examples, followed by a pipe, followed that what I want to extract.

Variable | (Variable)
Variable.list[3]name | (Variable.list[3]name)
Var.list[5] | (Var.list , 5)
Var.list_name[3]thing_words[4][3][2] | (Var.list_name[3]thing_words , 4 , 3 , 2)
Var[3] | (Var , 3)
Var.word | (Var.word)

And so on. I think that makes it clear, right? I want the variable name, which may or may not contain brackets, and if there are any trailing brackets, I want to exclude them from the name and capture them so I can access them from match.groups(). I don't think there are any variable with a name that ends with ...[] without a number inside, but there may be, and if there are, I want to ignore those too.

Right now I am trying to do something like:

for line in list:
regex = re.compile("^[-\w\[\]\.]+(\[(0-9)*]\])*$")
match = regex.match(line)
if match:
do something that depends on len( match.groups() )

But... it's not working. The regex never matches, even when I think it should.

In my mind, I am being very clear! I want it to start with a bunch of stuff and potentially end with a bunch of bracketted numbers, and if it ends with bracketed numbers, to catch them and store them, but ignore any bracketted numbers that are NOT at the end of the string.

So... now that I have thoroughly over explained myself to the point of being a little redundant... what do I do to make it work as I want? Can it even be done the way I am trying to do it? Should I instead do something more like:

while (match.endswith("]")
match.strip("]")
func()
match.strip("[")

where the func() does a regex to strip the number off the end? That seems overly complicated, and very messy. My gut tells me regex can handle it, and that my novice eyes just can't see how.

What do you need this for? The best solution may involve a change somewhere else. — user2357112, Aug 07 '13 at 00:40
I looks like a good use case for a parser instead of regex. Your input example is not far from an EBNF grammar. — Paulo Scardine, Aug 07 '13 at 00:41
Ah, I can't say what it's for, unfortunately, and I can't change what I am being passed. I'm sorry. Parser... I'll read up on that, but could you elaberate? I am also only weeks into Python in general. — CamelopardalisRex, Aug 07 '13 at 00:41
A parser is a program that will read the input and spit out what you want. Most parser libraries will consume a grammar (a normalized description of the file format) and return a parser, but you can also write your tokenizer/parser by hand. — Paulo Scardine, Aug 07 '13 at 00:47
I just read http://docs.python.org/2/library/parser.html and I think this is more along the lines of my second method, yes? I would use this parser instead of the func() or rather as the func() I wrote at the end, yes? — CamelopardalisRex, Aug 07 '13 at 00:48
that one is for parsing Python code; you are looking for http://wiki.python.org/moin/LanguageParsing — Paulo Scardine, Aug 07 '13 at 00:49
Oh, well, I'm not sure what this is. I'll have to learn about this too. Which Parser would you reccommend for my particulary query, so I can delve into that one specifically? — CamelopardalisRex, Aug 07 '13 at 00:54
Unfortunately I lack the time to write a proper answer and commenting further would be an abuse. If you are writing something that is meant to endure, taking a couple days to study about [EBNF](https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form) and LR/LL parsers will pay off in the future; if you just want a quick-and-dirty solution for a one-time thing, keep hacking regular expressions. Also, if it is a know format, perhaps somebody already wrote a grammar or parser for you. — Paulo Scardine, Aug 07 '13 at 01:07

llb · Accepted Answer · 2013-08-07T02:59:56.993

This problem is a little more complicated than I realized, because the re module doesn't allow you to repeat capturing groups, so you'll have to do some manual work to differentiate. First, use one regular expression to divide the string in the right place; then use another to find all the numbers.

def get_variable_and_sizes(var_string):
    result = re.search(r'(.*?)((?:\[\d*])*)$', var_string)
    var_name = result.group(1)
    numbers = re.findall(r'\[(\d+)]', result.group(2))
    return [var_name] + numbers

What we're doing here is breaking the problem in two parts. The first regular expression has two capturing groups: the first catches any number of characters (non-greedily), the second catches any number of repetitions of bracketed digits, as one unit (as noted, we can't repeat capturing groups, but we can repeat groups WITHIN a capturing group).

The first group in the match we get back is the variable name. The second group needs to be parsed further to identify all the numbers in it. Fortunately, it's easy to write a regular expression that captures a number inside brackets, and then use findall to get a list of all the matches in the second group. If there are no such matches, we get an empty list.

Finally, we make a list containing the variable name, concatenate the list we got back from the second regex, and return it.

Well, this doesn't quit work. Giving it var[2][2][2] has it return ('var','2') and not ('var','2','2','2') but this is definately a step in the right direction! Thanks! — CamelopardalisRex, Aug 07 '13 at 01:01
Okay, I edited and tested this out on your inputs. It seems to work. — llb, Aug 07 '13 at 01:29
Hey hey! Here we go! Dankeschone! This does it perfectly! I need to study this code though, because I don't understand why it works, but it certainly performs as desried! Thanks a bundle mate! — CamelopardalisRex, Aug 07 '13 at 02:27

score 1 · Answer 2 · answered Aug 07 '13 at 01:09

1

I don't think you can have a variable number of capturing groups. If you do, only the value of last capturing group will be captured. A workaround for this is if you know the max number of square brackets you will have at the end. in that case, you can simply repeat the code in your regex that number of times:

^[a-zA-Z\.]+(?:\[\d\][a-zA-Z\.]+)*(?:\[(\d)\])?(?:\[(\d)\])?(?:\[(\d)\])?(?:\[(\d)\])?$

this regex would capture up to 4 square bracket groups at the end of your string.

Other than that. I think a parser would be your best option.

answered Aug 07 '13 at 01:09

Jaime Morales

369
1
4

Is that so? So it really is impossible with my method. Alright, that out and out answers my query. Thank you. – CamelopardalisRex Aug 07 '13 at 01:13
Yeah. Pretty sure. This thread addresses a similar issue http://stackoverflow.com/questions/5018487/regular-expression-with-variable-number-of-groups – Jaime Morales Aug 07 '13 at 01:23
It might be impossible to do with one regular expression, because Python doesn't let you repeat capturing groups, but I think it's solvable with two. – llb Aug 07 '13 at 01:30

Using Regex in Python to get array sizes

2 Answers2