0

I'm looking for an elegant 'Pythonic' way to loop through input in groups of a certain size. But the size of the groups vary and you don't know the size of a particular group until you start parsing.

Actually I only care about groups of two sizes. I have a large sequence of input which come in groups of mostly size X but occasionally size Y. For this example lets say X=3 and Y=4 (my actual problem is X=7 and Y=8). I don't know until mid-way through the group of elements if it's going to be a group of size X or of size Y elements.

In my problem I'm dealing with lines of input but to simplify I'll illustrate with characters.

So if it's a group of a particular size I know I'll always get input in the same sequence. So for example if it's a size X group I'll be getting elements of type [a,a,b] but if it's a size Y group I'll be getting elements of type [a,a,c,b]. f it's something of type 'a' I'll want to process it in a certain way and 'b' another etc.

Obviously I have to test an element at some point to determine if it's of type one group or the other. As demonstrated above I cannot just check the type of every element because there may be two of the same in sequence. Also the groups may be the same pattern at the start and only differ near the end. In this example the earliest I can check if I'm in a size X or size Y group is by testing the 3rd element (to see if it's of type 'c' or type 'b').

I have a solution with a for loop with an exposed index and two extra variables, but I'm wondering if there is a more elegant solution.

Here is my code. I've put pass statements in place of where I would do the actual parsing depending on what type it is:

counter = 0
group = 3
for index, x in enumerate("aabaabaacbaabaacbaabaab"):
    column = index - counter;
    print(str(index) + ", " + x + ", " + str(column))

    if column == 0:
        pass
    elif column == 1:
        pass
    elif column == 2:
        if x == 'c':
            pass
        elif x == 'd':
            group = 4
    elif column == 3:
        pass

    if column + 1 == group:
        counter += group 
        group = 3

In the code example the input stream is aabaabaacbaabaacbaabaab so that is groups of:

  1. aab (3)
  2. aab (3)
  3. aacb (4)
  4. aab (3)
  5. aacb (4)
  6. aab (3)
  7. aab (3)
freshnewpage
  • 413
  • 1
  • 3
  • 10
  • Could you a) give a shorter, to the point description of the problem and b) provide a [MCVE](http://stackoverflow.com/help/mcve)? I tried to read your question three times and could not decipher what the question is. – timgeb Jan 23 '16 at 13:12
  • Basically I'm consuming a sequence of input that I know is two kinds of groups. As I go through the input, part way through I'll be able to work out if I'm in one kind of group or the other. Whilst I have something that works in my current solution I feel like the extra variables are something that would go in a C / Java style for loop with case statements but it feels clunky in Python. – freshnewpage Jan 23 '16 at 13:27

3 Answers3

4

I would use a generator that collect these groups and determines the size for each, and then ultimately yields each group:

def getGroups (iterable):
    group = []
    for item in iterable:
        group.append(item)
        if len(group) == 3 and group[2] == 'c':
            yield group
            group = []
        elif len(group) == 4 and group[2] == 'd':
            yield group
            group = []

for group in getGroups('abcabcabdcabcabdcabcabc'):
    print(group)
['a', 'b', 'c']
['a', 'b', 'c']
['a', 'b', 'd', 'c']
['a', 'b', 'c']
['a', 'b', 'd', 'c']
['a', 'b', 'c']
['a', 'b', 'c']
poke
  • 369,085
  • 72
  • 557
  • 602
2

Looks like you need a simple automata with backtracking, for example:

def parse(tokens, patterns):

    backtrack = False
    i = 0

    while tokens:

        head, tail = tokens[:i+1], tokens[i+1:]
        candidates = [p for p in patterns if p.startswith(head)]
        match = any(p == head for p in candidates)

        if match and (backtrack or len(candidates) == 1 or not tail):
            yield head
            tokens = tail
            backtrack = False
            i = 0

        elif not candidates:
            if not i or backtrack:
                raise SyntaxError, head
            else:
                backtrack = True
                i -= 1

        elif tail:
            i += 1

        else:
            raise SyntaxError, head

tokens = 'aabaabcaabaabcxaabxxyzaabx'
patterns = ['aab', 'aabc', 'aabcx', 'x', 'xyz']

for p in parse(tokens, patterns):
    print p
georg
  • 211,518
  • 52
  • 313
  • 390
0

With your particular example, you could use a regex:

>>> s="aabaabaacbaabaacbaabaab"
>>> re.findall(r'aac?b', s)
['aab', 'aab', 'aacb', 'aab', 'aacb', 'aab', 'aab']

If you want a parser that does the same thing, you can do:

def parser(string, patterns):
    patterns=sorted(patterns, key=len, reverse=True)
    i=0
    error=False
    while i<len(string) and not error:
        for pat in patterns:
            j=len(pat)
            if string[i:i+j]==pat:
                i+=j
                yield pat
                break
        else:
            error=True

    if error or i<len(string):
        raise SyntaxWarning, "Did not match the entire string"

>>> list(parser(s, ['aab', 'aacb']))
['aab', 'aab', 'aacb', 'aab', 'aacb', 'aab', 'aab']
dawg
  • 98,345
  • 23
  • 131
  • 206
  • Sure, this works for the example, but as I mentioned, this string is only illustrative; each character represents a line of input. The real input is a large sequence of lines. – freshnewpage Jan 23 '16 at 13:34
  • But you state you a dealing with this input line by line. That is a very typical regex application. Even if your data is 7 but sometimes 8 and just varies in some optional Nth element -- a regex is a robust way to discriminate. – dawg Jan 23 '16 at 18:01
  • Perhaps my example has confused things. In my example the string is meant to represent a list of lines. A character in the string represents a whole line. I only did it this way to be concise. As far as I'm aware you can't regex over groups of lines at a time. A more accurate example would be ['alpha', 'alpha', 'beta', 'alpha', 'alpha', 'gamma', 'beta', 'alpha', 'alpha', 'beta'] – freshnewpage Jan 23 '16 at 22:41
  • You can use a regex for that pattern as well. [Demo](https://regex101.com/r/yW6lB1/1) You can use a [mmap](http://stackoverflow.com/a/454589/298607) to use a regex on a large file... – dawg Jan 23 '16 at 23:26