How to sort a list of lists and and to keep only the maximal 2nd element of each of the 1st elements by intervals?

Question

This is an harder version of this question but I couldn't solve it efficiently (preferably without the need to import libraries).

Let's say that I have some list:

lst = [[1,2],[1,4],[1,6],[2,6],[2,3],[3,5],[7,8]]

And let's say that I have a list of intervals:

intervals = [0,3,5,8]

I want to keep in each interval one sublist by 1st element and the one that have the highest 2nd element. In this example it means that there will be only one sublist which the 1st element is between 0 & 3, one sublist which the 1st element is between 3 & 5, etc... so the result will be:

result:
>>> [[1,6],[3,5],[7,8]]

To be noted:

It is not very important if it will be in such way as {0 =< x < 3} or {0 < x =< 3} as long as there are no duplicates.
It is better that if we have, for example, [1,6] and [2,6] in the same interval that the one that will be kept is the one with the lowest 1st element ( [1,6] )

You're maybe looking for Codegolf. SO is not for coding quizes. — Banana, Sep 06 '20 at 17:22
@Justlearnedit TBH this is the first time the I've heard of Codegolf and needed to google it. But no, I'm trying to make an approximation algorithm and this interval process is in its basis but I couldn't find an efficient solution for it. Nevertheless, if this question is inappropriate for SO I'll remove it of course. — Guy, Sep 06 '20 at 17:28
@deadshot I want to keep only one sublist by the 1st element so that there will be one representor for each interval — Guy, Sep 06 '20 at 17:30
@Guy it's still not clear how `0` is realted to one of the list — deadshot, Sep 06 '20 at 17:34
@deadshot in this example the intervals list [0,3,5,8] means that there will be only one sublist which the 1st element is between 0 & 3, one sublist which the 1st element is between 3 & 5, etc... Is it clearer? I'll add it to the question itself. — Guy, Sep 06 '20 at 17:37
@deadshot It is not very important if it will be in such way as {0 =< x < 3} or {0 < x =< 3} as long as there are no duplicates between intervals — Guy, Sep 06 '20 at 17:42
expected output for this `[[1,2],[1,4],[1,6],[2,6],[4,6],[3,5],[4, 5]]`? — deadshot, Sep 06 '20 at 17:44
@deadshot It's in the question, Result: >>> [[1,6],[3,5],[7,8]] — Guy, Sep 06 '20 at 17:44
@deadshot Ok sorry, [[1,6],[4,6]] in this case the sublist with 3 cannot be part of any interval as in both there are sublist with 2nd element of 6(>5). — Guy, Sep 06 '20 at 17:48
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221055/discussion-between-guy-and-deadshot). — Guy, Sep 06 '20 at 17:51

RichieV · Accepted Answer · 2020-09-07T00:03:52.720

Here are three solutions, ordered by performance:

Create two lists for first/second number in each element. It increases memory usage but is the fastest option.
Use key parameter in max to get the element with highest second number. Avoids duplicating memory usage, but is about 30% slower. This could be a good middle ground.
Use itertools.groupby with a key function that gets the interval of the first number in each element. It can be used for more robust applications, but is not as efficient as for each element it iterates Intervals until it finds the matching interval. It is almost 3x slower than the first option.

Option 1: create two lists

Separating the list into two lists for first/second number of each element.

# sort and separate lst
lst = sorted(lst)
first = [e[0] for e in lst]
second = [e[1] for e in lst]

# iterate upper limits of intervals and get max of each sublist
i = k = 0
keep = []
while lst[i][0] < Intervals[0]:
    i += 1
for upper in Intervals[1:]:
    k = sum(f < upper for f in first[i:])
    keep.append(i + second[i:i+k].index(max(second[i:i+k])))
    i += k

result = [lst[i] for i in keep]
print(result)

Output

[[1, 6], [3, 5], [7, 8]]

Option 2: use max(lst, key)

You can get the element with the maximum second number with max(lst, key=lambda x: x[1]). Here is the implementation for the intervals.

lst = sorted(lst)

i = k = 0
result = []
for upper in Intervals:
    i += k
    # old solution summed a generator
    # k = sum(e[0] < upper for e in lst[i:])
    # this one uses a while-loop to avoid checking the rest of the list on each iteration
    # a good idea if `lst` is long and `Intervals` are many
    k = 0
    while i + k < len(lst) and lst[i+k][0] < upper: 
        k += 1
    if upper == Intervals[0]:
        continue
    result.append(max(lst[i:i+k], key=lambda x:x[1]))

Output

[[1, 6], [3, 5], [7, 8]]

Option 3: itertools.groubpy(lst, key)

from itertools import groupby

def get_bin(element, bins):
    x = element[0]
    if x < bins[0]:
        return -1
    elif x in bins:
        return bins.index(x)
    else:
        for i, b in enumerate(bins[1:]):
            if x < b:
                break
        return i
        

result = sorted([
    max(items, key=lambda x: x[1])
    for _, items in groupby(lst, lambda x: get_bin(x, Intervals))
])

Output

[[1, 6], [3, 5], [7, 8]]

Does `lambda x: get_bin(x, intervals)` instead `get_bin` help for you? — mathfux, Sep 06 '20 at 23:56
Btw, you can also use `first, second = zip(*lst)` in your option 1. This helps to avoid double iteration. And you might also like to see benchmarking results in my answer. — mathfux, Sep 07 '20 at 00:35
I guess that using that `.index()` method in your option 1 is a huge performance overload. — mathfux, Sep 07 '20 at 00:39
@mathfux thans for the `zip` suggestion, I did not like those two list comps either... as for the timings, using numpy will definitely speed things up, but on pure python solutions I got option 1 as the fastest, perhaps you are artificially helping the following solutions by having already sorted `lst` and further calls to sort will take significantly less time — RichieV, Sep 07 '20 at 01:49
@RichieV I have used sorting inside every of these 5 methods except option3. Sorting itself takes 2.8 seconds for million items. `second[i:i+k].index(max(second[i:i+k]))` is not optimal too (I know np.argmin would be a huge boost) and you need to add more script to remove that inefficient conjunction of index and max. Though these wastes were not essential. The worst thing here is checking condition f < upper for sufficed list which is half of million on average for any of these thousand uppers. This seems to be a reason why option1 is actually the slowest. – — mathfux, Sep 07 '20 at 02:21
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/221069/discussion-between-mathfux-and-richiev). — mathfux, Sep 07 '20 at 03:00

mathfux · Answer 2 · 2020-09-07T03:31:37.083

For simplicity:

lst = [[1,2],[1,4],[1,6],[2,6],[2,3],[3,5],[7,8]]
intervals = [0,3,5,8] #usually, variables starts lowercase

Initial version (not an answer yet)

I'll demonstrate how to split list into several groups by indices from intervals and then return maximum items of each group here. You can use a trick which I would like to call 'shift` of array:

def get_groups(lst, intervals):
    return [lst[i:j] for i,j in zip(intervals[:-1], intervals[1:])]

This is a nice way to construct tuples of slices that are: (0, 3), (3, 5), (5, 8). Now you have:

>>> groups = get_groups(lst, interval)
>>> groups
[[[1, 2], [1, 4], [1, 6]], 
 [[2, 6], [2, 3]], 
 [[3, 5], [7, 8]]]

And then you extract maximum elements when sorting by second column:

>>> [max(n, key = lambda x: x[1]) for n in groups]
[[1, 6], [2, 6], [7, 8]]

If it's important to distinguish between two items that has the same values of second column:

[max(n, key = lambda x: (x[1], x[0])) for n in groups]

Final version

OP required, in contrast, to split list into several groups by values that falls into intervals. It's possible to build an algorithm on top of first result if list is presorted and we are doing a single search of array in order to find indices where elements should be inserted to maintain order. In that case get_groups should be redefined as follows:

def get_groups(lst, intervals):
    lst = sorted(lst)
    firstcolumn = [n[0] for n in lst]
    intervals = searchsorted(first_column, intervals)
    return [lst[i:j] for i,j in zip(intervals[:-1], intervals[1:])]

At the moment you can also use adapted version of RichieV's answer:

def searchsorted(array, intervals):
    idx, i, n = [], 0, len(array)
    for upper in intervals:
        while array[i] < upper:
            i += 1
            if i == n:
                idx.append(n)
                return idx
        else:
            idx.append(i)
    return idx

>>> searchsorted([1,1,1,2,2,3,7], [0,3,5,8])
[0, 5, 6, 7]

Note that get_groups is not quite optimal because both first_column and lst is being iterated twice.

Usage:

def simple_sol(lst, intervals):
    return [max(n, key=lambda x: x[1]) for n in get_groups(lst, intervals)]
#Output: [[1, 6], [3, 5], [7, 8]]

Further optimisations

I've wrote a definition of searchsorted inspired by alternative method np.searchsorted which is based on binary search instead. It's also more efficient ( O(m log(n)) vs O(mn)). For Python version see also docs and source code of bisect.bisect_left and related answer about binary search. This is double win, C-level + binary search (pretty much the same as my previous answer):

def binsorted(lst, intervals):
    lst = np.array(lst)
    lst = lst[np.argsort(lst[:,0])] #sorting lst by first row
    idx = np.searchsorted(lst[:,0], intervals)
    if idx[-1] == len(lst):
        return np.maximum.reduceat(lst, idx[:-1], axis=0)
    else:
        return np.maximum.reduceat(lst, idx, axis=0)

#Output: [[2, 6], [3, 5], [7, 8]]

Benchmarking

I compared option1, option2, option3, simple_sol and binsorting for the samples:

lst = np.random.randint(1000, size = (1000000, 2)).tolist()
intervals = np.unique(np.random.randint(1000, size = 100)).tolist() + [1000]

and timeits were:

18.4 s ± 472 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.21 s ± 386 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10.3 s ± 410 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
4.12 s ± 202 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
1.38 s ± 97.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Nice trick, but I believe it does not apply.. the goal is to group elements by the interval of the first number (similar to `pandas.cut('column1')`), your code uses the Intervals as indices instead of cut-off values — RichieV, Sep 06 '20 at 18:40
@mathfux Thank you! but it seems like it grouping is not as I meant. See that for intervals = [0,3,5,8] we can expect a result of [[1,6],[3,5],[7,8]] — Guy, Sep 06 '20 at 19:06
@Guy. Okay, seems I get it. I need to extract groups by intervals of `x[0]`, not by intervals of `lst` itself. — mathfux, Sep 06 '20 at 19:11

How to sort a list of lists and and to keep only the maximal 2nd element of each of the 1st elements by intervals?

2 Answers2

Initial version (not an answer yet)

Final version

Further optimisations

Benchmarking