1

I am trying to workout the best solution for getting the top N elements with biggest number from a huge list with size of a few billions. So far, I have got the idea of:

get the first N elements, sort them in descending order (list A). 
for N+1 to last element:
    min = the Nth element. 
    if the N+1 element > min then insert it into list A and sort it. 
        remove the last element

Practically, seems like it doesn't consume too much memory, and faster than just using list.sort of the entire huge list follow by getting top N elements

However, this sorting doesn't use the full capacity of the CPU with multi-cores. Is there any built-in function or any other approaches that would do the job with multi-processes? or able to fully utilizes the computing capabilities which would result much faster?

  • Python by itself is single threaded. You would have to use multiprocessing libraries to fully use multiple cores. If you provide a code sample, there may be more efficient ways to handle your problem. – goalie1998 Jan 22 '21 at 05:57
  • ThreadPoolExecutor could be one possible solution. Although I would first just run this with PyPy if possible to see the difference in execution time. – Teejay Bruno Jan 22 '21 at 05:58
  • 1
    You could also check out [`np.partition`](https://numpy.org/doc/stable/reference/generated/numpy.partition.html). Similar idea [here](https://stackoverflow.com/a/23734295/10166393). But is `list.sort` really not fast enough? – ssp Jan 22 '21 at 05:59
  • Feels like heapifying might be a better solution if N is small relative to the length of the list. – Samwise Jan 22 '21 at 06:12
  • I haven't used NumPy package but running the under pypy3 win 32. – user4603876 Jan 22 '21 at 21:14
  • I am having developed a multiprocessing strategy, but not sure if I am doing smart enough. A number of threads acting as a consumer, taking the results then sort it and keep it with top N length and assign "min" as the Nth element. for any N+th smaller than min will be excluded, otherwise, update the list and min Once all results produced, take the top N elements of all those ordered result lists from threads. – user4603876 Jan 22 '21 at 21:25

1 Answers1

0

If you are looking to use parallelize the work, you could use a python library such as Ray.

Using Ray, you could parallelize your search by partitioning the data into multiple sets and having each thread attempt to find the largest N numbers of each subset. Afterwards, you should have k lists of N 'large' numbers. From there, you can find the largest N numbers.

If you would like to learn more about Ray documentation, you can check out the documentation.

Documentation: https://docs.ray.io/en/latest/

tcglezen
  • 456
  • 1
  • 4
  • 12