0

I needed a python function which would take a list of strings in the form of:

seq = ['A[0]','B[2:5]','A[4]']

and return a new list of "expanded" elements with preserved order, like so:

expanded = ['A[0]', 'B[2]', 'B[3]', 'B[4]', 'B[5]', 'A[4]']

To achieve my goal I wrote this simple function:

def expand_seq(seq):
    #['item[i]' for item in seq for xrange in item]
    return ['%s[%s]'%(item.split('[')[0],i) for item in seq for i in xrange(int(item.split('[')[-1][:-1].split(':')[0]),int(item.split('[')[-1][:-1].split(':')[-1])+1)]

When dealing with a sequence which would generate less than 500k items it works well, but it slows down quite a bit when generating very large lists (more 1 million). For example:

# let's generate 10 million items!
seq = ['A[1:5000000]','B[1:5000000]']
t1 = time.clock()
seq = expand_seq(seq)
t2 = time.clock()
print round(t2-t1, 3)
# RESULT: 9.541 seconds

I'm looking for ways to improve this function and hopefully speed it up when dealing with large lists. If anyone has suggestions, I would love to hear them!

Fnord
  • 5,365
  • 4
  • 31
  • 48
  • Sounds like a job for generators. – jedwards Mar 17 '15 at 00:40
  • 1
    Creating 10,000,000 elements in a list by iterating `xrange` will be expected to be slow. What are you going to do with the generated data? – thefourtheye Mar 17 '15 at 00:41
  • The proper optimization will be largely depending on what you want to do with your list afterwards. If you can/want to process generated elements one by one, a generator would be the way to go. If you want to do something else, then it could be more complicated. – Cilyan Mar 17 '15 at 00:41

2 Answers2

2

The following seems to give a 35% speedup:

import re

r = re.compile(r"(\w+)\[(\d+)(?::(\d+))?\]")

def expand_seq(seq):
    result = []
    for item in seq:
        m = r.match(item)
        name, start, end = m.group(1), int(m.group(2)), m.group(3)
        rng = xrange(start, int(end)) if end else (start,)
        t = name + "["
        result.extend(t + str(i) + "]" for i in rng)
    return result

With this code:

  • We compile a regular expression for use in the function.
  • We concatenate our strings directly.
JuniorCompressor
  • 19,631
  • 4
  • 30
  • 57
  • This solution is promising! I'm bad at using regex and i really should try harder. The only issue with your solution is that it fails to deal with items with no "range"... ex: A[0].. also an item like A[0:2] should expand to A[0],A[1],A[2], as is it stops at A[1]... Changing int(end) to int(end)+1 will fix – Fnord Mar 17 '15 at 04:01
  • actually scratch that last comment about not handling "no range" items, the error was due to testing with a list that had non-ascii characters (in this case a dot: 'A.p[0]' and that broke the reggex – Fnord Mar 17 '15 at 04:34
0

I'm not sure you'll get a dramatic speedup since you can't fundamentally improve the algorithm. I did get about a 20% speedup from yours by doing it this way:

def expand_seq(seq):
    expanded = []
    for s in seq:
        name, indices = s[0:-1].split("[")
        if ":" in indices:
            index1, index2 = [int(i) for i in indices.split(":")]
        else:
            index1 = int(indices)
            index2 = index1
        for n in range(index1, index2 + 1):
            expanded.append("{}[{}]".format(name, n))
    return expanded

I think the speedup is largely due to not repeating some operations (like int and split) which you had to do to keep your solution to a one-liner.

As has been suggested though, if you use a generator you can start consuming the results instantly. Like this:

def expand_seq(seq):
    for s in seq:
        name, indices = s[0:-1].split("[")
        if ":" in indices:
            index1, index2 = [int(i) for i in indices.split(":")]
        else:
            index1 = int(indices)
            index2 = index1
        for n in range(index1, index2 + 1):
            yield "{}[{}]".format(name, n)
Andrew Magee
  • 6,506
  • 4
  • 35
  • 58
  • Thanks for the suggestion! After doing some tests i decided to implement JuniorCompressor's solution as it was the fastest! – Fnord Mar 17 '15 at 04:53