Python - split list in to sublists based on another list

Question

I have two lists: l1 = [0, 0.002, 0.3, 0.5, 0.6, 0.9, 1.3, 1.9] and l2 = [0.5, 1.0, 1.5, 2.0]. I want to split l1 in to sublists that are defined as the elements between two indexes of l2. So for example l1 would be equal to [[0,0.002, 0.3], [0.5, 0.6, 0.9], [1.3], [1.9]].

Here is my solution:

l3 = []
b=0
for i in l2:
    temp = []
    for p in l1:
        if b <= p < i:
        temp.append(p)
    l3.append(temp)
    b+=0.5

This solution is a huge bottleneck in my code. Is there a faster way to do this?

@PeterWood or a hash map! Or an interval tree! So many possibilities! — Maciej Gol, Sep 16 '15 at 21:05

score 6 · Accepted Answer · answered Sep 16 '15 at 20:52

6

Your lists are sorted, so there is no need to do a double loop here.

The following generates the sublists based on the two lists as inputs:

def partition(values, indices):
    idx = 0
    for index in indices:
        sublist = []
        while idx < len(values) and values[idx] < index:
            sublist.append(values[idx])
            idx += 1
        if sublist:
            yield sublist

You can then iterate over partition(l1, l2) to get individual sublists, or call list() to produce the whole list-of-lists in one go:

>>> l1 = [0, 0.002, 0.3, 0.5, 0.6, 0.9, 1.3, 1.9] 
>>> l2 = [0.5, 1.0, 1.5, 2.0]
>>> list(partition(l1, l2))
[[0, 0.002, 0.3], [0.5, 0.6, 0.9], [1.3], [1.9]]

answered Sep 16 '15 at 20:52

Martijn Pieters

1,048,767
296
4,058
3,343

wouldn't this be `O(n*m)` like op's solution? would there be huge performance gain? – taesu Sep 16 '15 at 20:56
@taesu: the OPs is O(N*M). – Martijn Pieters Sep 16 '15 at 20:57
ahhh, you're right. thanks for that, and also thanks for your participation at SO. – taesu Sep 16 '15 at 20:57
@taesu, `index` goes through whole list, giving O(n). Now, while iterating over `indices`, `idx` is not decreasing and each iteration of the `while` loop increments `idx` by 1. This gives a maximum of `len (values)` iterations, bumping complexity to `O (n + m)`. In short, both lists are traversed once. – Maciej Gol Sep 16 '15 at 21:02

Mazdak · Answer 2 · 2015-09-16T21:29:32.170

3

As a fast way you can use numpy pretty most efficient way for huge lists :

>>> np.split(l1,np.searchsorted(l1,l2))
[array([ 0.   ,  0.002,  0.3  ]), array([ 0.5,  0.6,  0.9]), array([ 1.3]), array([ 1.9]), array([], dtype=float64)]

np.searchsorted will find the indices of l2 items within l1 while l1 remains sorted (with its default sort) and np.split will split your list based on a list of indices.

A benchmark with accepted answer on a list 1000 time bigger :

from timeit import timeit

s1="""

def partition(values, indices):
    idx = 0
    for index in indices:
        sublist = []
        while idx < len(values) and values[idx] < index:
            sublist.append(values[idx])
            idx += 1
        if sublist:
            yield sublist

l1 = [0, 0.002, 0.3, 0.5, 0.6, 0.9, 1.3, 1.9]*1000
l2 = [0.5, 1.0, 1.5, 2.0]
list(partition(l1, l2))

"""

s2="""
l1 = [0, 0.002, 0.3, 0.5, 0.6, 0.9, 1.3, 1.9]*1000
l2 = [0.5, 1.0, 1.5, 2.0]
np.split(l1,np.searchsorted(l1,l2))
   """

print '1st: ' ,timeit(stmt=s1, number=10000)
print '2nd : ',timeit(stmt=s2, number=10000,setup="import numpy as np")

Result :

1st:  17.5872459412
2nd :  10.3306460381

edited Sep 16 '15 at 21:29

answered Sep 16 '15 at 21:09

Mazdak

105,000
18
159
188

1

You should really move the creating of `l1` and `l2` *out* of the timeit tests (as well as defining the `partition()` function). – Martijn Pieters Sep 16 '15 at 21:50
I get 1.43 vs 1.1 for 1000 repetitions. Not bad for a pure-python implementation. – Martijn Pieters Sep 16 '15 at 21:52
1

Ah, there is a problem. You cannot just multiply `l1` by 1000, because my solution *requires `l1` to be sorted*. Not that it helps my case if you do sort the list properly as that makes my solution slightly slower as more results are produced. – Martijn Pieters Sep 16 '15 at 21:53
Answered my own question, my solution using pure python is faster – Padraic Cunningham Sep 16 '15 at 22:12
@MartijnPieters With moving `l1` and `l2` out of the timeit test it doesn't make much difference again numpy performs significantly better on large data.http://stackoverflow.com/questions/31598677/why-list-comprehension-is-much-faster-than-numpy-for-multiplying-arrays – Mazdak Sep 17 '15 at 06:00
@PadraicCunningham Yep as well your answer is elegant but still you need to timeit with large list to find out the power of Numpy ;-) – Mazdak Sep 17 '15 at 06:02
@Kasramvd, I timed it with the same size list as your tests ;) – Padraic Cunningham Sep 17 '15 at 09:44
@PadraicCunningham I meant very huge! – Mazdak Sep 17 '15 at 09:54
1

@PadraicCunningham 10000000000000000 time larger! :-D – Mazdak Sep 17 '15 at 09:57

Padraic Cunningham · Answer 3 · 2015-09-16T22:09:50.863

def split_l(a,b):
    it = iter(b)
    start, sub = next(it), []
    for ele in a:
        if ele >= start:
            yield sub
            sub, start = [], next(it)
        sub.append(ele)
    yield sub

print(list(split_l(l1,l2)))
[[0, 0.002, 0.3], [0.5, 0.6, 0.9], [1.3], [1.9]]

using kasras input this beats both the accepted answer and the numpy solution:

In [14]: l1 = [0, 0.002, 0.3, 0.5, 0.6, 0.9, 1.3, 1.9]*1000

In [15]: l1.sort()

In [16]: l2 = [0.5, 1.0, 1.5, 2.0]

In [17]: timeit list(partition(l1,l2))
1000 loops, best of 3: 1.53 ms per loop

In [18]: timeit list(split_l(l1,l2))
1000 loops, best of 3: 703 µs per loop

In [19]: timeit np.split(l1,np.searchsorted(l1,l2))
1000 loops, best of 3: 802 µs per loop

In [20]: list(split_l(l1,l2))  == list(partition(l1,l2))
Out[20]: True

Creating a local reference to append knocks even more off:

def split_l(a, b):
    it = iter(b)
    start, sub = next(it), []
    append = sub.append
    for ele in a:
        if start <= ele:
            yield sub
            start, sub = next(it), []
            append = sub.append
        append(ele)
    yield sub

Runs in just over the time of the numpy solution:

In [47]: l1.sort()

In [48]: timeit list(split_l(l1,l2))
1000 loops, best of 3: 498 µs per loop

In [49]: timeit list(partition(l1,l2))
1000 loops, best of 3: 1.73 ms per loop

In [50]: timeit np.split(l1,np.searchsorted(l1,l2))
1000 loops, best of 3: 812 µs per loop

LetzerWille · Answer 4 · 2015-09-16T21:54:30.760

0

l1 = [0, 0.002, 0.3, 0.5, 0.6, 0.9, 1.3, 1.9]

l2 = [0.5, 1.0, 1.5, 2.0]


  def partition(values, indices):

    temp = []
    p_list = []


    for j in range(len(indices)):
        for i in range(len(values)):
            if indices[j] > values[i]:
                temp.append(values[i])

        p_list.append(temp)

        # added to the partition values are truncated from the list
        values = values[len(temp):]

        temp = []

    print(p_list)

partition(l1, l2)

[[0, 0.002, 0.3], [0.5, 0.6, 0.9], [1.3], [1.9]]

edited Sep 16 '15 at 21:54

answered Sep 16 '15 at 21:32

LetzerWille

5,355
4
23
26

This is just as bad as the OP's solution; you have a O(N*M) quadratic algorithm here. – Martijn Pieters Sep 16 '15 at 21:55

Python - split list in to sublists based on another list

4 Answers4