Python: Sort list of lists numerically

Question

I have a list of x,y coordinates that I need to sort based on the x coordinate, then y coordinate when x is the same and eliminate duplicates of the same coordinates. For example, if the list is:

[[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], 
 [349.9, 486.6], [450.0, 313.3]]

I would need to rearrange it to:

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6],
 [500.0, 400.0]]

(with one duplicate of [450.0, 313.3] removed)

score 6 · Answer 1 · answered Jun 21 '18 at 01:41

6

That is the normal sort order for a list of lists, anyway. De-dupe it with a dict.

>>> L = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3], [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
>>> sorted({tuple(x): x for x in L}.values())
[[300.0, 400.0],
 [349.9, 486.6],
 [350.0, 313.3],
 [450.0, 313.3],
 [450.0, 486.6],
 [500.0, 400.0]]

answered Jun 21 '18 at 01:41

wim

338,267
99
616
750

Is a dict more efficient than a set? – Olivier Melançon Jun 21 '18 at 01:45
Probably. Sets can't contain lists. – wim Jun 21 '18 at 01:46
Of course, you cast to tuple first, then put in a set – Olivier Melançon Jun 21 '18 at 01:46
1

@OlivierMelançon, which is exactly what I've done :) – jpp Jun 21 '18 at 01:47
1

But then you'd have to convert back to lists to get your output. – wim Jun 21 '18 at 01:51
dict was faster on my platform (Python 3.7.0b4 on macOS). Guess there's not much in it. – wim Jun 21 '18 at 02:09
2

@OlivierMelançon Actually, take a look at my answer. With a more randomized input set than the one created by `jpp` (`L=100000*L`), dictionary approach is slightly faster than `jpp` solution. – AGN Gazer Jun 21 '18 at 03:43

Paul Panzer · Answer 2 · 2018-06-21T03:18:24.143

2

As we are sorting anyway we can dedupe with groupby:

>>> import itertools
>>> [k for k, g in itertools.groupby(sorted(data))]                                                                 
[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3], [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

A few timings:

>>> import numpy as np # just to create a large example
>>> a = np.random.randint(0, 215, (10000, 2)).tolist()
>>> len([k for k, g in groupby(sorted(a))])
8977 # ~ 10% duplicates
>>> 
>>> timeit("[k for k, g in groupby(sorted(a))]", globals=globals(), number=1000)
6.1627248489967315
>>> timeit("sorted({tuple(x): x for x in a}.values())", globals=globals(), number=1000)
6.654527607999626
>>> timeit("sorted(unique(a, key=tuple))", globals=globals(), number=1000)
7.198703720991034
>>> timeit("np.unique(a, axis=0).tolist()", globals=globals(), number=1000)
8.848866895001265

edited Jun 21 '18 at 03:18

answered Jun 21 '18 at 02:22

Paul Panzer

51,835
3
54
99

Can you comment on the memory implications? I suspect [based on this](https://stackoverflow.com/questions/4154571/sorted-using-generator-expressions-rather-than-lists/4155652#4155652) that all but the `unique` will be making copies of the data. Of course, I may be wrong :). – jpp Jun 21 '18 at 03:38
2

`unique` needs to keep track of what's been seen already, meaning near the end of it being exhausted there will be the input list, `unique`s seen-that set and `sorted`s input list (`sorted` has to collect the entire input before it can start sorting) simultaneaously in memory. So there is no memory advantage for `unique`. – Paul Panzer Jun 21 '18 at 04:58

AGN Gazer · Answer 3 · 2018-06-21T04:50:29.187

What you want seems to be easily done with numpy's unique function:

import numpy as np
u = np.unique(data, axis=0) # or np.unique(data, axis=0).tolist()

If you are really worried that the array is not sorted by columns, then run np.lexsort() in addition to the above:

u = u[np.lexsort((u[:,1], u[:,0]))]

Timings (non-random sample):

In [1]: import numpy as np

In [2]: from toolz import unique

In [3]: data = [[450.0, 486.6], [500.0, 400.0], [450.0, 313.3],
   ...:  [350.0, 313.3], [300.0, 400.0], [349.9, 486.6], [450.0, 313.3]]
   ...:  

In [4]: L = 100000 * data

In [5]: npL = np.array(L)

In [6]: %timeit sorted(unique(L, key=tuple))
125 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [7]: %timeit sorted({tuple(x): x for x in L}.values())
139 ms ± 3.41 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [8]: %timeit np.unique(L, axis=0)
732 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [9]: %timeit np.unique(npL, axis=0)
584 ms ± 8.11 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @user3483203 solution:

In [57]: %timeit lex(np.asarray(L))
227 ms ± 8.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [58]: %timeit lex(npL)
76.2 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timings (more random sample):

When sample data are more random, the results are different:

In [29]: npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))

In [30]: L = npL.tolist()

In [31]: %timeit sorted(unique(L, key=tuple))
143 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [32]: %timeit sorted({tuple(x): x for x in L}.values())
134 ms ± 1.14 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [33]: %timeit np.unique(L, axis=0)
78.5 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [34]: %timeit np.unique(npL, axis=0)
54 ms ± 398 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @Paul Panzer's solution:

In [36]: import itertools

In [37]: %timeit [k for k, g in itertools.groupby(sorted(L))]
123 ms ± 3.42 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# @user3483203 solution:

In [54]: %timeit lex(np.asarray(L))
60.1 ms ± 744 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [55]: %timeit lex(npL)
38.8 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This doesn't seem to preserve pairs. `np.unique` flattens by default. — Paul Panzer, Jun 21 '18 at 03:11
@PaulPanzer Thanks! Locally, I have `axis=0` but not when posting... It's late — AGN Gazer, Jun 21 '18 at 03:13

user3483203 · Answer 4 · 2018-06-21T05:24:45.823

We can do this quite fast using np.lexsort and some masking

def lex(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

L = np.array(L)
lex(L)

# Output:
[[300.  400. ]
 [349.9 486.6]
 [350.  313.3]
 [450.  313.3]
 [450.  486.6]
 [500.  400. ]]

Performance

`Functions`

def chrisz(arr):                 
    tmp =  arr[np.lexsort(arr.T),:]
    tmp = tmp[np.append([True],np.any(np.diff(tmp,axis=0),1))]
    return tmp[np.lexsort((tmp[:, 1], tmp[:, 0]), axis=0)]

def pp(data):
    return [k for k, g in itertools.groupby(sorted(data))]

def gazer(data):
    return np.unique(data, axis=0)

def wim(L):
    return sorted({tuple(x): x for x in L}.values())

def jpp(L):
    return sorted(unique(L, key=tuple))

`Setup`

res = pd.DataFrame(
       index=['chrisz', 'pp', 'gazer', 'wim', 'jpp'],
       columns=[10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        npL = np.random.randint(1,1000,(c,2)) + np.random.choice(np.random.random(1000), (c, 2))
        L = npL.tolist()
        stmt = '{}(npL)'.format(f) if f in {'chrisz', 'gazer'} else '{}(L)'.format(f)
        setp = 'from __main__ import L, npL, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=50)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

`Validation`

npL = np.random.randint(1,1000,(100000,2)) + np.random.choice(np.random.random(1000), (100000, 2))    
L = npL.tolist()    
chrisz(npL).tolist() == pp(L) == gazer(npL).tolist() == wim(L) == jpp(L)
True

@AGNGazer should be fixed now, it actually got faster I believe. I validated using `np.all(np.unique(npL, axis=0) == lex(npL))` — user3483203, Jun 21 '18 at 04:41

jpp · Answer 5 · 2018-06-21T03:31:56.893

0

Here's one way using sorted and toolz.unique:

from toolz import unique

res = sorted(unique(L, key=tuple))

print(res)

[[300.0, 400.0], [349.9, 486.6], [350.0, 313.3],
 [450.0, 313.3], [450.0, 486.6], [500.0, 400.0]]

Note toolz.unique is also available via the standard library as the itertools unique_everseen recipe. Tuple conversion is necessary as the algorithm uses hashing via set to check uniqueness.

Performance using set appears slightly better than dict here, but as always you should test with your data.

L = L*100000

%timeit sorted(unique(L, key=tuple))               # 223 ms
%timeit sorted({tuple(x): x for x in L}.values())  # 243 ms

I suspect this is because unique is lazy, and so you have less memory overhead since sorted isn't making a copy of the input data.

edited Jun 21 '18 at 03:31