Splitting lists by short numbers

Question

I'm using NumPy to find intersections on a graph, but isClose returns multiple values per intersection

So, I'm going to try to find their averages. But first, I want to isolate the similar values. This is also a useful skill I feel.

I have a list of the x values for the intersection called idx that looks like this

[-8.67735471 -8.63727455 -8.59719439 -5.5511022  -5.51102204 -5.47094188
 -5.43086172 -2.4248497  -2.38476954 -2.34468938 -2.30460922  0.74148297
  0.78156313  0.82164329  3.86773547  3.90781563  3.94789579  3.98797595
  7.03406814  7.0741483   7.11422846]

and I want to separate it out into lists each comprised of the similar numbers.

this is what I have so far:

n = 0
for i in range(len(idx)):
    try:
        if (idx[n]-idx[n-1])<0.5:
            sdx.append(idx[n-1])
        else:
            print(sdx)
            sdx = []
    except:
        sdx.append(idx[n-1])
    n = n+1

It works for the most part but it forgets some numbers:

[-8.6773547094188377, -8.6372745490981959]
[-5.5511022044088181, -5.5110220440881763, -5.4709418837675354]
[-2.4248496993987976, -2.3847695390781567, -2.3446893787575149]
[0.7414829659318638, 0.78156312625250379]
[3.8677354709418825, 3.9078156312625243, 3.9478957915831661]

Theres probably a more efficient way to do this, does anyone know of one?

What is the end use of this? Are you making a histogram? What determines the groupings -- is it just they are within 0.5 of each other? What would you expect to happen on [-0.5, 0.0, 0.5, 1.0, 1.5]? — wflynny, May 14 '15 at 15:25
Please describe what this is supposed to do. Guessing it from non-functioning code is not an option. And where are all the commas that your idx is missing? — Stefan Pochmann, May 14 '15 at 15:26
Why are you looping with `for i in range(len(idx)):` and then using `n` (which you have to manually increment) for indexing the list? — SuperBiasedMan, May 14 '15 at 15:28
I edited to the question to answer some of these, sorry. Also `idx` is a numpy array, thats probably why it has no commas. — user3151828, May 14 '15 at 15:33
You forgot to answer the most important question(s)... nobody knows what you mean with "short" or "similar". — Stefan Pochmann, May 14 '15 at 15:36

Padraic Cunningham · Accepted Answer · 2015-05-14T16:39:40.923

Considering you have a numpy array, you can use np.split, splitting where the difference is > .5:

import numpy as np
x = np.array([-8.67735471, -8.63727455, -8.59719439, -5.5511022, -5.51102204, -5.47094188,
     -5.43086172, -2.4248497, -2.38476954, -2.34468938, -2.30460922, 0.74148297,
     0.78156313, 0.82164329, 3.86773547, 3.90781563, 3.94789579, 3.98797595,
     7.03406814, 7.0741483])


print np.split(x, np.where(np.diff(x) > .5)[0] + 1)

[array([-8.67735471, -8.63727455, -8.59719439]), array([-5.5511022 , -5.51102204, -5.47094188, -5.43086172]), array([-2.4248497 , -2.38476954, -2.34468938, -2.30460922]), array([ 0.74148297,  0.78156313,  0.82164329]), array([ 3.86773547,  3.90781563,  3.94789579,  3.98797595]), array([ 7.03406814,  7.0741483 ])]

np.where(np.diff(x) > .5)[0] returns the index where the following element does not meet the np.diff(x) > .5) condition:

In [6]: np.where(np.diff(x) > .5)[0]
Out[6]: array([ 2,  6, 10, 13, 17])

+ 1 adds 1 to each index:

In [12]: np.where(np.diff(x) > .5)[0] + 1
Out[12]: array([ 3,  7, 11, 14, 18])

Then passing [ 3, 7, 11, 14, 18] to np.split splits the elements into subarrays, x[:3], x[3:7],x[7:11] ...

Divakar · Answer 2 · 2015-05-14T17:27:57.930

If your final destination is finding average values of each cluster/group, where each cluster would be marked by little difference that don't cross a certain threshold, you can use the approach listed next.

Basically, we convert the input list to a numpy array, sort it and then find consecutive differences. Based on the differences when compared against a certain threshold, we create a ID array with same IDs for elements from the same group. Finally, using those IDs, we do binning and averaging within the bins with np.bincount, essentially getting the average of each group.

Here's the implementation -

import numpy as np

# Input list
AList = [-8.67735471, -8.63727455, -8.59719439, -5.5511022,  -5.51102204,
         -5.47094188, -5.43086172, -2.4248497,  -2.38476954, -2.34468938,
         -2.30460922,  0.74148297,  0.78156313,  0.82164329,  3.86773547,
    3.90781563, 3.94789579,  3.98797595,  7.03406814,  7.0741483, 7.11422846]

# Tolerance as thresholding parameter to distinguish between two "groups"
tolerance = 1

# Convert to a numpy array and sort if not already sorted
A = np.sort(np.asarray(AList))

# ID array that has the same IDs for elements of the same group
ID_array = (np.append([False],np.diff(A)>tolerance)).cumsum()

# Finally get the average values for each group    
average_values = np.bincount(ID_array,A)/np.bincount(ID_array)

Sample run -

In [301]: A
Out[301]: 
array([-8.67735471, -8.63727455, -8.59719439, -5.5511022 , -5.51102204,
       -5.47094188, -5.43086172, -2.4248497 , -2.38476954, -2.34468938,
       -2.30460922,  0.74148297,  0.78156313,  0.82164329,  3.86773547,
        3.90781563,  3.94789579,  3.98797595,  7.03406814,  7.0741483 ,
        7.11422846])

In [302]: average_values
Out[302]: 
array([-8.63727455, -5.49098196, -2.36472946,  0.78156313,  3.92785571,
        7.0741483 ])

Splitting lists by short numbers

2 Answers2

Linked