0

I have an array of events over time. There are more than a billion rows, and each row is made of:

  • user_id
  • timestamp
  • event_type
  • other high-precision (float64) numerical columns

I need to sort this array in multiple manners, but let's take timestamp for example. I've tried several methods, including:

  • sort the entire array directly with arr = arr[arr.argsort()] => awful performance
  • using the same method as above, sort chunks of the array in parallel and then do a final sort => better performance

Still, even with the "sort in chunks then merge" method, performance rapidly degrades with the number of items. It takes more than 5 minutes to sort 200M rows on a 128 CPU machine. This is a problem because I need to perform multiple sortings consecutively, and I need to do it "on the fly", i.e. persisting the sorted array on disk once and for all is not an option because items are always added and removed, not necessarily in chronological order.

Is there a way to drastically improve the performance of this sorting? For instance, could a Numba implementation of mergesort which works in parallel be a solution? Also, I'd be interested in a method which works for sorting on multiple columns (e.g. sort on [user_id, timestamp]).

Notes:

  • the array is of dtype np.float64 and needs to stay that way because of the contents of other columns which are not mentioned in the above example.

  • we thought of using a database with separate tables, one for each particular sorting — advantage would be that the tables are always perfectly sorted, but the drawback is in terms of retrieving speed

  • for the above reason, we went with local Parquet files, which are blazingly fast to retrieve, but then it means we need to sort them once loaded

Jivan
  • 21,522
  • 15
  • 80
  • 131
  • 1
    What's the array dtype? – hpaulj Jun 18 '20 at 10:38
  • good point - it's float64 (i added this into the question) – Jivan Jun 18 '20 at 10:40
  • 3
    "This is a problem because I need to perform multiple sortings consecutively, and I need to do it "on the fly", i.e. persisting the sorted array on disk once and for all is not an option because items are always added and removed, not necessarily in chronological order." Have you considered using a relational database, e.g. postgres? – juanpa.arrivillaga Jun 18 '20 at 10:42
  • how often do you have to run the sorting? just once? it might be faster using a c++ implementation or even CUDA if you have to do it very often. – Dorian Jun 18 '20 at 10:43
  • @juanpa.arrivillaga we did test postgres, however it was among the worst in terms of performance in retrieving results — it's an interesting idea, though, as we could simply put everything there in separate tables (one for each way of sorting) and they would all be always sorted — the problem is just that it takes ages to retrieve – Jivan Jun 18 '20 at 10:45
  • @Dorian do you have a way to make a multiple-billion-rows array fit into a single GPU? I've tried using CuPy but the array just wouldn't fit into the GPU's memory. Any tips welcome, though. – Jivan Jun 18 '20 at 10:47
  • @juanpa.arrivillaga performance is even worse when using a single table with indexes – Jivan Jun 18 '20 at 10:50
  • I think it would be helpful to include everything in this post you tried so far. For example, I would be curious to see the SQL you used to test postgres. – Paul Brodersen Jun 18 '20 at 10:52
  • @juanpa.arrivillaga using `arr.sort()` sorts the entire array by order of columns, and I don't need that — I need one copy sorted by one column (and only that column), another copy sorted by a different column (and only that column), and so on – Jivan Jun 18 '20 at 10:55
  • have you considered using a structured dtype, then passing the `order` argument on the necessary field in `np.ndarray.sort`? – juanpa.arrivillaga Jun 18 '20 at 11:10
  • @juanpa.arrivillaga I have considered it, yes, but it seems that using structured dtypes comes with numerous other problems in terms of ability to play with vectorised ufuncs and also performance (is has the advantage of being less memory consuming though, as not all fields have to be 64 bits just because one has to) – Jivan Jun 18 '20 at 11:12
  • You can use a view to sort. Like [this](https://stackoverflow.com/questions/2828059/sorting-arrays-in-numpy-by-column). – juanpa.arrivillaga Jun 18 '20 at 11:13
  • @juanpa.arrivillaga I'm trying this — will update you with the result – Jivan Jun 18 '20 at 11:23
  • @juanpa.arrivillaga sorting with views is twice as slow as sorting directly with argsort – Jivan Jun 18 '20 at 11:26
  • 1
    @juanpa.arrivillaga https://stackoverflow.com/a/35624868/2091169 – Jivan Jun 18 '20 at 12:10
  • Is it necessary for the application to actually shuffle the data of the entire array? Can you get away with storing the result of `argsort()` and use indirect indexing? You want to use ufuncs, but those don't require a specific array order. You might consider storing the array in column-major order. – Han-Kwang Nienhuys Jun 18 '20 at 21:01

0 Answers0