3

I am trying to sort a large number of arrays in python. I need to perform the sorting for over 11 million arrays at once.

Also, it would be nice if I could directly get the indices that would sort the array.

That is why, as of now I'm using numpy.argsort() but thats too slow on my machine (takes over an hour to run)

The same operation in R is taking about 15 minutes in the same machine.

Can anyone tell me a faster way to do this in Python?

Thanks

EDIT:

Adding an example

If I have the following dataframe :

agg:

x      y        w        z  

1      2        2        5                 
1      2        6        7         
3      4        3        3        
5      4        7        8    
3      4        2        5    
5      9        9        9    

I am running the following function and command on it:

def fucntion(group):
    z = group['z'].values   
    w = group['w'].values 
    func = w[np.argsort(z)[::-1]][:7]  #i need top 7 in case there are many  
    return np.array_str(func)[1:-1]

output = agg.groupby(['x,'y']).apply(function).reset_index()

so my output dataframe will look like this:

output:

x   y   w   

1   2   6,2    
3   4   2,3    
5   4   7    
5   9   9
Divakar
  • 218,885
  • 19
  • 262
  • 358
user324
  • 331
  • 1
  • 5
  • 15

3 Answers3

4

Well for cases like those where you are interested in partial sorted indices, there's NumPy's argpartition.

You have the troublesome np.argsort in : w[np.argsort(z)[::-1]][:7], which is essentially w[idx], where idx = np.argsort(z)[::-1][:7].

So, idx could be calculated with np.argpartition, like so -

idx = np.argpartition(-z,np.arange(7))[:7]

That -z is needed because by default np.argpartition tries to get sorted indices in ascending order. So, to reverse it, we have negated the elements.

Thus, the proposed change in the original code would be :

func = w[np.argpartition(-z,np.arange(7))[:7]]

Runtime test -

In [162]: z = np.random.randint(0,10000000,(1100000)) # Random int array

In [163]: idx1 = np.argsort(z)[::-1][:7]
     ...: idx2 = np.argpartition(-z,np.arange(7))[:7]
     ...: 

In [164]: np.allclose(idx1,idx2) # Verify results
Out[164]: True

In [165]: %timeit np.argsort(z)[::-1][:7]
1 loops, best of 3: 264 ms per loop

In [166]: %timeit np.argpartition(-z,np.arange(7))[:7]
10 loops, best of 3: 36.5 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • That is a good solution, but in case somewhere in my dataframe, the numbers to be sorted are less than 7, then I think it wont work. (it is a possibility. The output needs to be at most 7) – user324 May 04 '16 at 19:51
  • @GunjanDewan So, just replace the `7` here by that number? You can keep that as a variable and let the variable handle it? Something like `n = 5; func = w[np.argpartition(-z,np.arange(n))[:n]]`, where `n` is that variable. – Divakar May 04 '16 at 19:53
  • @GunjanDewan Or are you saying `z` itself could be less than `7` elements? – Divakar May 04 '16 at 20:01
  • Yes. z itself can be less than 7. But I've added a variable on len(z). Im running it on my dataset currently. I'm hoping it works faster. – user324 May 04 '16 at 20:04
  • @GunjanDewan Yup, that's what I was about to suggest, to use `n = min(len(z),7)` and then `func = w[np.argpartition(-z,np.arange(n))[:n]]`. Would love to see the runtime results from your end too! Keep me posted. – Divakar May 04 '16 at 20:06
  • @GunjanDewan Very glad to hear that! – Divakar May 04 '16 at 21:10
1

The reason python is so much slower than R is that by python does not typecast variables (i.e. int, string, float), so part of each comparison to determine which value is larger is spent determining the variable type.

You can't fix this problem using python alone, but you can include type definitions using cython (ctypes and psyco also can perform the same function, but I prefer cython). An simple example of how this works is on http://docs.cython.org/src/quickstart/cythonize.html

Cython compiles a .c version of your python file, that can be imported instead of the .py to reduce the runtime. All the possible ways to compile using cython are shown on http://docs.cython.org/src/reference/compilation.html

dan arters
  • 208
  • 2
  • 10
  • 1
    You seem to have ignored or missed the fact that the questioner is using NumPy. NumPy and R need to do similar amounts of type checking to each other; both only need to check the array's element type once when sorting it, rather than once for each comparison. – user2357112 May 04 '16 at 19:10
0

Your input and output is a bit confusing. Please provide some sample data.

But look into: http://pandas.pydata.org/pandas-docs/stable/api.html#reshaping-sorting-transposing Pandas sorting is as optimized as it gets. Focus on the series sort as each column of the DataFrame is more accurately represented as a series.

feynmanium
  • 159
  • 2
  • 13
  • I have edited it further. Please tell me if its clear now. – user324 May 04 '16 at 19:18
  • Gunjan what are you trying to do here? Can you verbally explain what you want it to do. This example is meaningless and if there isn't an explanation of what you want then the solution to the problem is limited to just your code. – feynmanium May 04 '16 at 19:51