I have an array of events over time. There are more than a billion rows, and each row is made of:
- user_id
- timestamp
- event_type
- other high-precision (float64) numerical columns
I need to sort this array in multiple manners, but let's take timestamp for example. I've tried several methods, including:
- sort the entire array directly with
arr = arr[arr.argsort()]
=> awful performance - using the same method as above, sort chunks of the array in parallel and then do a final sort => better performance
Still, even with the "sort in chunks then merge" method, performance rapidly degrades with the number of items. It takes more than 5 minutes to sort 200M rows on a 128 CPU machine. This is a problem because I need to perform multiple sortings consecutively, and I need to do it "on the fly", i.e. persisting the sorted array on disk once and for all is not an option because items are always added and removed, not necessarily in chronological order.
Is there a way to drastically improve the performance of this sorting? For instance, could a Numba implementation of mergesort which works in parallel be a solution? Also, I'd be interested in a method which works for sorting on multiple columns (e.g. sort on [user_id, timestamp]
).
Notes:
the array is of dtype
np.float64
and needs to stay that way because of the contents of other columns which are not mentioned in the above example.we thought of using a database with separate tables, one for each particular sorting — advantage would be that the tables are always perfectly sorted, but the drawback is in terms of retrieving speed
for the above reason, we went with local Parquet files, which are blazingly fast to retrieve, but then it means we need to sort them once loaded