More elegant way to implement regexp-like quantifiers

Question

I'm writing a simple string parser which allows regexp-like quantifiers. An input string might look like this:

s = "x y{1,2} z"

My parser function translates this string to a list of tuples:

list_of_tuples = [("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]

Now, the tricky bit is that I need a list of all valid combinations that are specified by the quantification. The combinations all have to have the same number of elements, and the value None is used for padding. For the given example, the expected output is

[["x", "y", None, "z"], ["x", "y", "y", "z"]]

I do have a working solution, but I'm not really happy with it: it uses two nested for loops, and I find the code somewhat obscure, so there's something generally awkward and clumsy about it:

import itertools

def permute_input(lot):
    outer = []
    # is there something that replaces these nested loops?
    for val, start, end in lot:
        inner = []
        # For each tuple, create a list of constant length
        # Each element contains a different number of 
        # repetitions of the value of the tuple, padded
        # by the value None if needed.
        for i in range(start, end + 1):
            x = [val] * i + [None] * (end - i)
            inner.append(x)
        outer.append(inner)
    # Outer is now a list of lists.

    final = []
    # use itertools.product to combine the elements in the
    # list of lists:
    for combination in itertools.product(*outer):
        # flatten the elements in the current combination,
        # and append them to the final list:
        final.append([x for x 
                    in itertools.chain.from_iterable(combination)])
    return final

print(permute_input([("x", 1, 1), ("y", 1, 2), ("z", 1, 1)]))
[['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']]

I suspect that there's a much more elegant way of doing this, possibly hidden somewhere in the itertools module?

Can you do something wth one of these answers? They use different approaches, so I guess it comes down to what works best for you. (Although I am free to pick one to award the bonus to, I'd appreciate your input. At the moment I'm still undecided.) — Jongware, Jan 29 '17 at 23:40
It's a tough choice. @Josay's list comprehension is straightforward and very concise, but perhaps less legible than what I was hoping for. Using generators as in @Marat's answer is of course highly pythonic, but the code is even more complex. @alecxe's ``pyparsing`` answer introduces an additional dependency and requires a bunch of new objects. However, that module appears to be so flexible that I think about replacing by homegrown parser by it. — Schmuddi, Jan 30 '17 at 17:24
I see you accepted one of the answers. Do you mind if I award the bonus to another one? :) By the way, I'm really impressed that your question resulted in three different, yet all efficient approaches. — Jongware, Jan 30 '17 at 18:20
@RadLexus: Sure, go ahead. ;) Like you, I'm also surprised by how different the answers are. Thanks for encouraging them by offering a bounty! — Schmuddi, Jan 30 '17 at 18:27
Twas a pleasure, and well worth it. Still, a rather hard choice, but for me it was @josay's clear explanation of how to adjust your original code that won — Jongware, Jan 30 '17 at 18:33

score 6 · Accepted Answer · edited May 23 '17 at 11:53

One alternative way to approach the problem is to use pyparsing and this example regex parser that would expand a regular expression to possible matching strings. For your x y{1,2} z sample string it would generate two possible strings expanding the quantifier:

$ python -i regex_invert.py 
>>> s = "x y{1,2} z"
>>> for item in invert(s):
...     print(item)
... 
x y z
x yy z

The repetition itself supports both an open-ended range and a closed range and is defined as:

repetition = (
    (lbrace + Word(nums).setResultsName("count") + rbrace) |
    (lbrace + Word(nums).setResultsName("minCount") + "," + Word(nums).setResultsName("maxCount") + rbrace) |
    oneOf(list("*+?"))
)

To get to the desired result, we should modify the way the results are yielded from the recurseList generator and return lists instead of strings:

for s in elist[0].makeGenerator()():
    for s2 in recurseList(elist[1:]):
        yield [s] + [s2]  # instead of yield s + s2

Then, we need to only flatten the result:

$ ipython3 -i regex_invert.py 

In [1]: import collections

In [2]: def flatten(l):
   ...:     for el in l:
   ...:         if isinstance(el, collections.Iterable) and not isinstance(el, (str, bytes)):
   ...:             yield from flatten(el)
   ...:         else:
   ...:             yield el
   ...:             

In [3]: s = "x y{1,2} z"

In [4]: for option in invert(s):
   ...:     print(list(flatten(option)))
   ...: 
['x', ' ', 'y', None, ' ', 'z']
['x', ' ', 'y', 'y', ' ', 'z']

Then, if needed, you can filter the whitespace characters:

In [5]: for option in invert(s):
   ...:     print([item for item in flatten(option) if item != ' '])
   ...:     
['x', 'y', None, 'z']
['x', 'y', 'y', 'z']

I wasn't aware of ``pyparsing``, and it is an interesting module that I should keep an eye on. However, as specified in the question, the output has to be lists of equal lengths, padded by ``None`` where appropriate. Can ``pyparsing`` do that? — Schmuddi, Jan 25 '17 at 14:39
@Schmuddi yup, we can get to the desired output, I'll update the answer. — alecxe, Jan 25 '17 at 15:17
@Schmuddi sorry for the delay, please see the updated answer. Hope this works for you. — alecxe, Jan 28 '17 at 15:31

Marat · Answer 2 · 2017-01-29T14:06:15.950

Recursive solution (simple, good for up to few thousand tuples):

def permutations(lot):
    if not lot:
        yield []
    else:
        item, start, end = lot[0]
        for prefix_length in range(start, end+1):
            for perm in permutations(lot[1:]):
                yield [item]*prefix_length + [None] * (end - prefix_length) + perm

It is limited by the recursion depth (~1000). If it is not enough, there is a simple optimization for start == end cases. Dependin on the expected size of list_of_tuples it might be enough

Test:

>>> list(permutations(list_of_tuples))  # list() because it's an iterator
[['x', 'y', None, 'z'], ['x', 'y', 'y', 'z']]

Without recursion (universal but less elegant):

def permutations(lot):
    source = []
    cnum = 1  # number of possible combinations
    for item, start, end in lot:  # create full list without Nones
        source += [item] * (end-start+1)
        cnum *= (end-start+1)

    for i in range(cnum):
        bitmask = [True] * len(source)
        state = i
        pos = 0
        for _, start, end in lot:
            state, m = divmod(state, end-start+1)  # m - number of Nones to insert
            pos += end-start+1
            bitmask[pos-m:pos] = [None] * m
        yield [bitmask[i] and c for i, c in enumerate(source)]

The idea behind this solution: actually, we are kind of looking full string (xyyz) though a glass wich adds certain number of None. We can count numer of possible combinations by calculating product of all (end-start+1). Then, we can just number all iterations (simple range loop) and reconstruct this mask from the iteration number. Here we reconstruct the mask by iteratively using divmod on the state number and using remainder as the number of Nones at the symbol position

score 2 · Answer 3 · answered Jan 24 '17 at 12:56

The part generating the different lists based on the tuple can be written using list comprehension:

outer = []
for val, start, end in lot:
    # For each tuple, create a list of constant length
    # Each element contains a different number of 
    # repetitions of the value of the tuple, padded
    # by the value None if needed.
    outer.append([[val] * i + [None] * (end - i) for i in range(start, end + 1)])

(The whole thing would be again be written with list comprehension but it makes the code harder to read IMHO).

On the other hand, the list comprehension in [x for x in itertools.chain.from_iterable(combination)] could be written in a more concise way. Indeed, the whole point is to build an actual list out of an iterable. This could be done with : list(itertools.chain.from_iterable(combination)). An aternative would be to use the sum builtin. I am not sure which is better.

Finally, the final.append part could be written with a list comprehension.

# use itertools.product to combine the elements in the list of lists:
# flatten the elements in the current combination,
return [sum(combination, []) for combination in itertools.product(*outer)]

The final code is just based on the code you've written slightly re-organised:

outer = []
for val, start, end in lot:
    # For each tuple, create a list of constant length
    # Each element contains a different number of 
    # repetitions of the value of the tuple, padded
    # by the value None if needed.
    outer.append([[val] * i + [None] * (end - i) for i in range(start, end + 1)])

# use itertools.product to combine the elements in the list of lists:
# flatten the elements in the current combination,
return [sum(combination, []) for combination in itertools.product(*outer)]

More elegant way to implement regexp-like quantifiers

3 Answers3