For simplicity:
lst = [[1,2],[1,4],[1,6],[2,6],[2,3],[3,5],[7,8]]
intervals = [0,3,5,8] #usually, variables starts lowercase
Initial version (not an answer yet)
I'll demonstrate how to split list into several groups by indices from intervals
and then return maximum items of each group here. You can use a trick which I would like to call 'shift` of array:
def get_groups(lst, intervals):
return [lst[i:j] for i,j in zip(intervals[:-1], intervals[1:])]
This is a nice way to construct tuples of slices that are: (0, 3)
, (3, 5)
, (5, 8)
. Now you have:
>>> groups = get_groups(lst, interval)
>>> groups
[[[1, 2], [1, 4], [1, 6]],
[[2, 6], [2, 3]],
[[3, 5], [7, 8]]]
And then you extract maximum elements when sorting by second column:
>>> [max(n, key = lambda x: x[1]) for n in groups]
[[1, 6], [2, 6], [7, 8]]
If it's important to distinguish between two items that has the same values of second column:
[max(n, key = lambda x: (x[1], x[0])) for n in groups]
Final version
OP required, in contrast, to split list into several groups by values that falls into intervals
. It's possible to build an algorithm on top of first result if list is presorted and we are doing a single search of array in order to find indices where elements should be inserted to maintain order. In that case get_groups
should be redefined as follows:
def get_groups(lst, intervals):
lst = sorted(lst)
firstcolumn = [n[0] for n in lst]
intervals = searchsorted(first_column, intervals)
return [lst[i:j] for i,j in zip(intervals[:-1], intervals[1:])]
At the moment you can also use adapted version of RichieV's answer:
def searchsorted(array, intervals):
idx, i, n = [], 0, len(array)
for upper in intervals:
while array[i] < upper:
i += 1
if i == n:
idx.append(n)
return idx
else:
idx.append(i)
return idx
>>> searchsorted([1,1,1,2,2,3,7], [0,3,5,8])
[0, 5, 6, 7]
Note that get_groups
is not quite optimal because both first_column
and lst
is being iterated twice.
Usage:
def simple_sol(lst, intervals):
return [max(n, key=lambda x: x[1]) for n in get_groups(lst, intervals)]
#Output: [[1, 6], [3, 5], [7, 8]]
Further optimisations
I've wrote a definition of searchsorted inspired by alternative method np.searchsorted
which is based on binary search instead. It's also more efficient ( O(m log(n))
vs O(mn)
). For Python version see also docs and source code of bisect.bisect_left
and related answer about binary search. This is double win, C-level + binary search (pretty much the same as my previous answer):
def binsorted(lst, intervals):
lst = np.array(lst)
lst = lst[np.argsort(lst[:,0])] #sorting lst by first row
idx = np.searchsorted(lst[:,0], intervals)
if idx[-1] == len(lst):
return np.maximum.reduceat(lst, idx[:-1], axis=0)
else:
return np.maximum.reduceat(lst, idx, axis=0)
#Output: [[2, 6], [3, 5], [7, 8]]
Benchmarking
I compared option1
, option2
, option3
, simple_sol
and binsorting
for the samples:
lst = np.random.randint(1000, size = (1000000, 2)).tolist()
intervals = np.unique(np.random.randint(1000, size = 100)).tolist() + [1000]
and timeit
s were:
18.4 s ± 472 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.21 s ± 386 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.3 s ± 410 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.12 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.38 s ± 97.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)