3

I have a list that contains either 1's or 0's; nothing else. I am interested in finding the 1's and, more specifically, where a run of 1's starts and where that run ends (or in the code below, the "length" of that run of 1's....it can either be the "length" of that run or the ending index position of that run, as I can do math and figure out the length from the start and ending positions). I'm storing the runs of 1's in a hash. Is there a faster way to get what I'm after than what I have? I'm still learning python and the list I am using in real life is much, much larger, so speed is important.

previous = 0
cnt = 0
startLength = {} 
for r in listy: 
    if previous == 0 and r == 1:
        start = cnt
        startLength[start] = 1
    if previous == 1 and r == 1: 
        startLength[start] = 1 + cnt - start 
    previous = r
    cnt += 1

for s,l in startLength.iteritems():
    print "A run of 1's starts at position %s and lasts %s" % (s,l)
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
user_78361084
  • 3,538
  • 22
  • 85
  • 147
  • 2
    It feels weird to propose this as a dupe twice in one day, but possible duplicate of [What's the most pythonic way to identify consecutive duplicates in a list?](http://stackoverflow.com/q/6352425) – jscs Mar 07 '13 at 19:53
  • I didn't see that. My question is a bit different but along those same lines. – user_78361084 Mar 07 '13 at 20:51
  • @Mr. Gaga - if speed is important to you, then using the answer in the linked duplicate specifically the groupby functionality will end up being slower. – Louis Ricci Mar 11 '13 at 18:33

4 Answers4

8

I might use itertools.groupby for this one

lst = [ 1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0]

from itertools import groupby
from operator import itemgetter

for k,v in groupby(enumerate(lst),key=itemgetter(1)):
    if k:
        v = list(v)
        print v[0][0],v[-1][0]

This will print the start and end indices of the groups of 1's

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • thx for helping us python newbs! – user_78361084 Mar 07 '13 at 20:52
  • Since the question is asking about, speed "fastest", this seems like a very inefficient answer. It's clearly the most pythonic answer, but I'd imagine this would run much slower than the OPs implementation. – Louis Ricci Mar 07 '13 at 20:55
  • @LastCoder -- Why would you think that this is inefficient? The purpose of `itertools` is to do things like this efficiently. – mgilson Mar 07 '13 at 20:57
  • 1
    @mgilson - the added overhead of groupby + enumerate + itemgetter. Just because it's less writing doesn't mean it will execute faster. – Louis Ricci Mar 07 '13 at 21:05
  • @LastCoder -- Of course I understand that less lines != faster code. You'd have to `timeit` to see, but all of those functions are highly optimized, so I wouldn't expect the overhead to be significant for a list of a reasonable size (But again, there's no sense really talking about that too much until we've profiled using OP's real-world data). – mgilson Mar 07 '13 at 21:08
  • 1
    @mgilson - I updated my answer with a quick benchmark, http://ideone.com/ydaX9y as I suspected with just 10000 elements your version performs 2-3 times slower. It doesn't matter how efficient the library code is it still has to iterate through the list and perform excessive operations. Again if you want something fast it looks different than something simple/elegant. But since even the OP approved of your answer I can only assume he didn't really want something fast, just something python-ish. – Louis Ricci Mar 08 '13 at 00:13
2

Apart from @mgilson's pythonic answer you can also try something like this:

lst = [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1]

start, end = False, False

for i, x in enumerate(lst):
    if x == 1 and start is False:
        start = i
    if x == 0 and start is not False and end is False:
        end = i-1
    if start is not False and end is not False:
        print start, end  # and len is (end-start+1)
        start, end = False, False

if start is not False:
    print start, i

output:

0 4
12 15
22 23
Ashwini Chaudhary
  • 244,495
  • 58
  • 464
  • 504
  • this won't catch a final run of 1s in the sequence. you need to add if start is not False and end is not False: print start, end at the end. – Max Nov 20 '15 at 16:20
  • Sorry, ignore my previous comment, you need to add the line: this won't catch a final run of 1s in the sequence. you need to add if start is not False: print start, i at the end. – Max Nov 20 '15 at 16:32
1

Here's a slightly more efficient solution (sorry it's JavaScript). The key is to only store the length once, in your code you are making a calculation every time the length is increased by one "startLength[start] = 1 + cnt - start".

By using the condition "if previous == 0 and r == 1" instead of your "if previous == 1 and r == 1". I reduce the amount of calculations, but I also have to add a "if r == 1" after the for loop to catch the final case.

var test=[0,1,1,1,0,0,0,1,1,0,0,1,0];
function runs(arr) {
    var result = {};
    var start = 0;
    var previous = 0;
    var cnt = 0;
    var r = 0;
    for(; cnt<arr.length; cnt++) {
        var r = arr[cnt];
        if(r == 1 && previous == 0)
            start = cnt;
        if(r == 0 && previous == 1)
            result[start] = cnt - start;
        previous = r;
    }
    if(r == 1)
        result[start] = cnt - start;
    return result;
}
var result = runs(test);
for(var start in result)
    console.log("start " + start + " length " + result[start]);

EDIT 2 Here's a python benchmark showing that using the groupby function (currently the top answer to this question) is substantially slower.

from itertools import groupby
from operator import itemgetter
import random
import time

lst = [ 1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0]

def makeList(size):
    random.seed()
    return [random.randint(0,1) for r in xrange(size)]


def runs1(lst, showOutput):
    startLength = {}
    for k,v in groupby(enumerate(lst),key=itemgetter(1)):
        if k:
            v = list(v)
            startLength[v[0][0]] = v[-1][0] + 1 - v[0][0]
    if showOutput == True:
        for s,l in startLength.iteritems():
            print s,l

def runs2(lst, showOutput):
    previous = 0
    cnt = 0
    startLength = {} 
    for r in lst: 
        if previous == 0 and r == 1:
            start = cnt
        if previous == 1 and r == 0: 
            startLength[start] = cnt - start
        previous = r
        cnt += 1
    if r == 1:
        startLength[start] = cnt - start
    if showOutput == True:
        for s,l in startLength.iteritems():
            print s,l

testList = makeList(10)
print "slow version"
runs1(testList, True)
print "fast version"
runs2(testList, True)

benchmarkList = makeList(10000)

start = time.time()
runs1(benchmarkList, False)
print "slow ", time.time() - start
start = time.time()
runs2(benchmarkList, False)
print "fast ", time.time() - start

start = time.time()
runs1(benchmarkList, False)
print "slow ", time.time() - start
start = time.time()
runs2(benchmarkList, False)
print "fast ", time.time() - start

start = time.time()
runs1(benchmarkList, False)
print "slow ", time.time() - start
start = time.time()
runs2(benchmarkList, False)
print "fast ", time.time() - start
Louis Ricci
  • 20,804
  • 5
  • 48
  • 62
-1

You can use a regex for this, if you convert your list to a string.

import re
import random

int_list = [random.randint(0,1) for i in xrange(1000000)]
runs = re.findall('1+', ''.join(map(str, int_list) # get a list of one-runs

# Print their lengths.
for run in runs:
    print len(run)

# If you really need to know the indexes where the runs are found, instead use
runs = re.finditer('1+', ''.join(map(str, int_list) # get a list of matches
for run in runs:
    print 'Start:\t%s' % run.start()
    print 'End:\t%s' % run.end()
    print 'Length:\t%s' % run.end()-run.start()
Henry Keiter
  • 16,863
  • 7
  • 51
  • 80
  • 2
    Given he starts with a list of integers, he'd first have to convert each element to a string and concatenate them all together. That's not exactly what I'd call "fast". He's also not trying to get a list of all runs without context, he's trying to get each run in context of the overall list. – Silas Ray Mar 07 '13 at 19:48
  • Disagree w/ the downvote. It works. – mechanical_meat Mar 07 '13 at 19:49
  • 4
    It works at doing something the OP didn't ask to do... – Silas Ray Mar 07 '13 at 19:49
  • 2
    @bernie The downvote button doesn't say "this answer does not work", it says "this answer is not useful". –  Mar 07 '13 at 19:51
  • Point taken. Thanks to you both for offering your thoughts. – mechanical_meat Mar 07 '13 at 19:52
  • @sr2222 It's reasonably quick to convert a list of ints to a string... The regex above executes in 2.4 seconds for a randomly initiated list of 10,000,000 ints, including the `map` and `join`. – Henry Keiter Mar 07 '13 at 19:55
  • 1
    Yet still slower than not doing it at all, which is better, since it's not necessary. And besides, you still aren't generating the values the OP asked for; he wants the start indices (within the original sequence) and lengths for each run, not just a list of the runs themselves, so you've only gotten part way to the goal. – Silas Ray Mar 07 '13 at 19:58