0

I have a vector of a few hundred million elements that are np.uint8. They only range in value from 0 to 255.

I need to sort this list, and I think it should be more quickly handled than quicksort. I think I could find indices for all value "0" and put them at the front, then for all value "1" and put them after the last insert, and progress until I am done. It would be the mutant offspring of unique and sort with some indexing and should work very quickly.

Is there a built-in that does this well, properly, and in something fast like "C" without me having to homebrew it? Could you point me to it?

Let's say that as a "toy" of my actual problem that I wanted to sort the intensity values for each color (rgb) of a 100 megapixel version of the mandrill image where each color had been converted into a single, very long, vector of uint8 values. If I were to time the difference between sorting methods, which would be reasonable to compute, in python?

EngrStudent
  • 1,924
  • 31
  • 46
  • 5
    You are correct, if you have only a small number of values (e.g. 256) you can just use the "bucket sort" you described (where the number of bins is the number of values). It will work in `O(n)` time (with additional space implications) – Itay May 15 '19 at 16:30
  • 2
    Don't reinvent wheel. Most languages now implement merge sort which is faster than anything you can homebrew. What you describe here is radix sort, which simply increases index at when it hits specific number. Then you can re-create array ordered. – rAndom69 May 15 '19 at 16:30
  • 1
    If you search on the phrase "Python sorting tutorial", you’ll find resources that can explain it much better than we can in an answer here. – Prune May 15 '19 at 16:33
  • 1
    Is that an array or list? If array, is it 1D? – Divakar May 15 '19 at 16:37
  • 3
    What are you going to do with the sorted list? There might be something even simpler than a sorted copy of the data, depending on what you are going to do with the result. E.g. take a look at `numpy.bincount`. – Warren Weckesser May 15 '19 at 16:38
  • 1
    @Prune - I looked around a bit before coming here. It looks, initially, like numpy doesn't have radixsort or bucket sort mentioned in the above comments. Numpy has "quick", "merge" and "heap". – EngrStudent May 15 '19 at 16:38
  • @WarrenWeckesser - thank you very much for that pointer. I can contrive my problem in cdf domain and have a list that is 256 instead of 128M elements long. Very very nice. It is a literal 1M speedup! That is why I ask stuff like that here. No standard sort in the world would give what I just got in 5 minutes from asking someone with a lot more clue than me. – EngrStudent May 15 '19 at 16:40
  • Ah; it sounds like you're trying to "bin" these things, rather than a whole-hog *sort*. Yes, @WarrenWeckesser's pointer is likely what you need. – Prune May 15 '19 at 16:42
  • 1
    @WarrenWeckesser Would you like to post that as an answer? Think it's worth it. – Divakar May 15 '19 at 16:58
  • 1
    @Divakar: done. – Warren Weckesser May 15 '19 at 17:43

3 Answers3

3

You might find that numpy.bincount is all you need. It counts the number of occurrences of each integer in the data.

For example, here's some random unsigned 8 bit integers:

In [100]: np.random.seed(8675309)                                                                                                                                                                      

In [101]: r = np.random.gamma(9, scale=8.5, size=100000).astype(np.uint8)                                                                                                                              

In [102]: r.min(), r.max()                                                                                                                                                                             
Out[102]: (11, 242)

Count the integers using bincount:

In [103]: b = np.bincount(r, minlength=256)                                                                                                                                                            

In [104]: b                                                                                                                                                                                            
Out[104]: 
array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          1,    1,    3,    1,    5,   13,    9,   17,   24,   31,   27,
         41,   55,   63,   96,  131,  146,  178,  210,  204,  268,  297,
        308,  367,  422,  480,  512,  584,  635,  669,  671,  759,  830,
        885,  934,  955, 1025, 1105, 1146, 1145, 1344, 1271, 1353, 1300,
       1456, 1419, 1451, 1504, 1499, 1561, 1600, 1509, 1678, 1621, 1643,
       1633, 1616, 1574, 1677, 1664, 1682, 1625, 1608, 1581, 1598, 1575,
       1583, 1524, 1493, 1381, 1448, 1399, 1422, 1249, 1322, 1225, 1278,
       1174, 1246, 1128, 1161, 1077,  999, 1033,  980,  981,  897,  917,
        880,  813,  779,  774,  697,  716,  651,  612,  657,  592,  556,
        497,  482,  474,  484,  445,  411,  399,  354,  368,  363,  342,
        313,  301,  293,  263,  241,  249,  244,  196,  215,  182,  189,
        172,  161,  139,  143,  142,  120,  121,  104,  103,  112,   88,
         82,   88,   67,   60,   83,   57,   63,   59,   50,   52,   55,
         40,   34,   34,   43,   35,   33,   28,   24,   26,   20,   18,
         21,   26,   30,   17,   15,   12,   17,   11,    7,    6,   16,
          8,    3,    4,   12,    9,    6,    5,    8,   10,    7,    1,
          4,    8,    5,    3,    2,    1,    0,    1,    1,    0,    1,
          0,    0,    4,    2,    2,    0,    0,    2,    0,    1,    1,
          1,    4,    0,    0,    0,    0,    0,    0,    0,    0,    2,
          1,    0,    0,    0,    0,    0,    1,    0,    0,    0,    0,
          0,    0,    0,    1,    0,    0,    0,    0,    0,    0,    0,
          1,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0])

So 0 occurs 0 times in r, 13 occurs 3 times, etc:

In [105]: b[0], b[13]                                                                                                                                                                                  
Out[105]: (0, 3)
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
1

You can do this without sorting using bincount and repeat:

In [11]: a = np.bincount(np.array([71, 9, 1, 2, 3, 4, 4, 4, 8, 9, 9, 71], dtype=np.uint8), minlength=256)

In [12]: np.repeat(np.arange(256, dtype=np.uint8), a)
Out[12]: array([ 1,  2,  3,  4,  4,  4,  8,  9,  9,  9, 71, 71], dtype=uint8)
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
0

If you only need the sorted values, then bincount is indeed the way to go. If, however, what you require is more argsort-like, for example, you need to co-sort some other same length array, then you can use this Q&A It compares various methods some of which are much faster than the 'naive' argsort.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99