1

I have a (big) array with data and I would like to sort it with respect to given intervals. My code is very slow and I was wondering if there is a python function that could help me. I am aware that there is a sort-function and it can be used to keep the indices

sorted(range(len(vals)), key=vals.__getitem__)

but that doesn't seem too helpful with the intervals.

Here is my minimal example:

intervals = np.arange(0, 1, 0.1)
data = np.random.rand(5, 5)
output = [[] for _ in range(len(intervals))]

for i in range(np.shape(data)[0]):
    for j in range(np.shape(data)[0]):
        for k in range(len(intervals)-1):
            if (data[i,j] >= intervals[k]) & (data[i,j] > intervals[k+1]):
                output[i].append(data[i,j])

Thanks for any help!

Edit: I created some more realistic data that I use:

intervals = np.arange(0, 5000, 500)
data = [[1091, 781, 675, 3644, 677], 
        [91, 751, 2675, 644, 75], 
        [3791, 4756, 3675, 644, 3669], 
        [791, 1754, 1679, 4643, 2675], 
        [2991, 2751, 675, 2629, 2659]]

output = [[] for _ in range(len(intervals))]


for i in range(np.shape(data)[0]):
    for j in range(np.shape(data)[1]):
        for k in range(len(intervals)-1):
            if (data[i][j] >= intervals[k]) & (data[i][j] < intervals[k+1]):
                output[k].append(data[i][j])

# output = [[91, 75], [781, 675, 677, 751, 644, 644, 791, 675], [1091], [1754, 1679], [], [2675, 2675, 2991, 2751, 2629, 2659], [], [3644, 3791, 3675, 3669], [], []]
hwerner
  • 49
  • 4
  • 1
    The initialization of `output` is incorrect. Please refer to [List of lists changes reflected across sublists unexpectedly](https://stackoverflow.com/questions/240178/list-of-lists-changes-reflected-across-sublists-unexpectedly). BTW, can you provide a small input and expected output? – Mechanic Pig Sep 14 '22 at 09:16
  • Do you mean `output = [[1]*1 for _ in range(len(intervals))]` ? – hwerner Sep 14 '22 at 09:59
  • `[[] for _ in range(len(intervals))]`, this is the correct way to create nested lists. – Mechanic Pig Sep 14 '22 at 10:05
  • I ran your code and got this output: – Stef Sep 14 '22 at 12:05
  • `[[1091, 1091, 781, 675, 3644, 3644, 3644, 3644, 3644, 3644, 3644, 677], [751, 2675, 2675, 2675, 2675, 2675, 644], [3791, 3791, 3791, 3791, 3791, 3791, 3791, 4756, 4756, 4756, 4756, 4756, 4756, 4756, 4756, 4756, 3675, 3675, 3675, 3675, 3675, 3675, 3675, 644, 3669, 3669, 3669, 3669, 3669, 3669, 3669], [791, 1754, 1754, 1754, 1679, 1679, 1679, 4643, 4643, 4643, 4643, 4643, 4643, 4643, 4643, 4643, 2675, 2675, 2675, 2675, 2675], [2991, 2991, 2991, 2991, 2991, 2751, 2751, 2751, 2751, 2751, 675, 2629, 2629, 2629, 2629, 2629, 2659, 2659, 2659, 2659, 2659], [], [], [], [], []]` – Stef Sep 14 '22 at 12:05
  • Is this really the output that you want? Could you explain the logic? – Stef Sep 14 '22 at 12:06
  • I'm guessing `if (data[i][j] >= intervals[k]) & (data[i][j] > intervals[k+1]):` was meant to be `if (data[i][j] >= intervals[k]) & (data[i][j] < intervals[k+1]):`; you used `>` instead of `<` – Stef Sep 14 '22 at 12:06
  • I'm also guessing that `output[i].append(data[i][j])` was supposed to be `output[k].append(data[i][j])`; you wrote `i` instead of `k` – Stef Sep 14 '22 at 12:09
  • It's not clear to me what king of output you want. Here is one possibility: `output = {}; for row in data: for x in row: output.setdefault(x // 500, []).append(x)`. This results in `output = {2: [1091], 1: [781, 675, 677, 751, 644, 644, 791, 675], 7: [3644, 3791, 3675, 3669], 0: [91, 75], 5: [2675, 2675, 2991, 2751, 2629, 2659], 9: [4756, 4643], 3: [1754, 1679]}` – Stef Sep 14 '22 at 12:15
  • Also note that your `for j in range(np.shape(data)[0])` should probably be `for j in range(np.shape(data)[1])` – Stef Sep 14 '22 at 12:16
  • Sorry for the confusion, I edited the example. – hwerner Sep 14 '22 at 13:52

1 Answers1

1

Assuming the intervals all have the same length 500, then you can find which interval a number x belongs to with a simple division.

If the intervals are [start, start+500), [start+500, start+1000), ... then number x belongs to interval number (x - start) // 500.

list of lists

intervals = np.arange(0, 5000, 500)
data = [[1091, 781, 675, 3644, 677], 
        [91, 751, 2675, 644, 75], 
        [3791, 4756, 3675, 644, 3669], 
        [791, 1754, 1679, 4643, 2675], 
        [2991, 2751, 675, 2629, 2659]]

start, end = intervals[0], intervals[-1]
length = intervals[1] - interval[0]

output = [[] for _ in range(len(intervals) - 1)]
out_of_bounds = []
for row in data:
    for x in row:
        if start <= x < end:
            output[x // 500].append(x)
        else:
            out_of_bounds.append(x)

print(output)
# [[91, 75], [781, 675, 677, 751, 644, 644, 791, 675], [1091], [1754, 1679], [], [2675, 2675, 2991, 2751, 2629, 2659], [], [3644, 3791, 3675, 3669], []]

print(out_of_bounds)
# [4756, 4643]

Note how np.arange(0, 5000, 500) is array([ 0, 500, 1000, 1500, 2000, 2500, 3000, 3500, 4000, 4500]) and doesn't have value 5000. If you want to include 5000, then choose intervals = np.arange(0, 5000+1, 500) instead.

dict of lists

If you don't know the bounds or the number of intervals in advance, but only their length, then use a dictionary rather than a list:

length = 500
output = {}
for row in data:
    for x in row:
        output.setdefault((x // length) * length, []).append(x)

print(output)
# {1000: [1091], 500: [781, 675, 677, 751, 644, 644, 791, 675], 3500: [3644, 3791, 3675, 3669], 0: [91, 75], 2500: [2675, 2675, 2991, 2751, 2629, 2659], 4500: [4756, 4643], 1500: [1754, 1679]}
Stef
  • 13,242
  • 2
  • 17
  • 28