14

I have a ranking function that I apply to a large number of columns of several million rows which takes minutes to run. By removing all of the logic preparing the data for application of the .rank( method, i.e., by doing this:

ranked = df[['period_id', 'sector_name'] + to_rank].groupby(['period_id', 'sector_name']).transform(lambda x: (x.rank(ascending = True) - 1)*100/len(x))        

I managed to get this down to seconds. However, I need to retain my logic, and am struggling to restructure my code: ultimately, the largest bottleneck is my double use of lambda x:, but clearly other aspects are slowing things down (see below). I have provided a sample data frame, together with my ranking functions below, i.e. an MCVE. Broadly, I think that my questions boil down to:

(i) How can one replace the .apply(lambda x usage in the code with a fast, vectorized equivalent? (ii) How can one loop over multi-indexed, grouped, data frames and apply a function? in my case, to each unique combination of the date_id and category columns.
(iii) What else can I do to speed up my ranking logic? the main overhead seems to be in .value_counts(). This overlaps with (i) above; perhaps one can do most of this logic on df, perhaps via construction of temporary columns, before sending for ranking. Similarly, can one rank the sub-dataframe in one call?
(iv) Why use pd.qcut() rather than df.rank()? the latter is cythonized and seems to have more flexible handling of ties, but I cannot see a comparison between the two, and pd.qcut() seems most widely used.

Sample input data is as follows:

import pandas as pd
import numpy as np
import random

to_rank = ['var_1', 'var_2', 'var_3']
df = pd.DataFrame({'var_1' : np.random.randn(1000), 'var_2' : np.random.randn(1000), 'var_3' : np.random.randn(1000)})
df['date_id'] = np.random.choice(range(2001, 2012), df.shape[0])
df['category'] = ','.join(chr(random.randrange(97, 97 + 4 + 1)).upper() for x in range(1,df.shape[0]+1)).split(',')

The two ranking functions are:

def rank_fun(df, to_rank): # calls ranking function f(x) to rank each category at each date
    #extra data tidying logic here beyond scope of question - can remove
    ranked = df[to_rank].apply(lambda x: f(x))
    return ranked


def f(x):
    nans = x[np.isnan(x)] # Remove nans as these will be ranked with 50
    sub_df = x.dropna() # 
    nans_ranked = nans.replace(np.nan, 50) # give nans rank of 50

    if len(sub_df.index) == 0: #check not all nan.  If no non-nan data, then return with rank 50
        return nans_ranked

    if len(sub_df.unique()) == 1: # if all data has same value, return rank 50
        sub_df[:] = 50
        return sub_df

    #Check that we don't have too many clustered values, such that we can't bin due to overlap of ties, and reduce bin size provided we can at least quintile rank.
    max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
    max_bins = len(sub_df) / max_cluster 

    if max_bins > 100: #if largest cluster <1% of available data, then we can percentile_rank
        max_bins = 100

    if max_bins < 5: #if we don't have the resolution to quintile rank then assume no data.
        sub_df[:] = 50
        return sub_df

    bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)

    sub_df_ranked = pd.qcut(sub_df, bins, labels=False) #currently using pd.qcut.  pd.rank( seems to have extra functionality, but overheads similar in practice
    sub_df_ranked *= (100 / bins) #Since we bin using the resolution specified in bins, to convert back to decile rank, we have to multiply by 100/bins.  E.g. with quintiles, we'll have scores 1 - 5, so have to multiply by 100 / 5 = 20 to convert to percentile ranking
    ranked_df = pd.concat([sub_df_ranked, nans_ranked])
    return ranked_df

And the code to call my ranking function and recombine with df is:

# ensure don't get duplicate columns if ranking already executed
ranked_cols = [col + '_ranked' for col in to_rank]

ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank)) 
ranked.columns = ranked_cols        
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)    
df = df.join(ranked[ranked_cols])

I am trying to get this ranking logic as fast as I can, by removing both lambda x calls; I can remove the logic in rank_fun so that only f(x)'s logic is applicable, but I also don't know how to process multi-index dataframes in a vectorized fashion. An additional question would be on differences between pd.qcut( and df.rank(: it seems that both have different ways of dealing with ties, but the overheads seem similar, despite the fact that .rank( is cythonized; perhaps this is misleading, given the main overheads are due to my usage of lambda x.

I ran %lprun on f(x) which gave me the following results, although the main overhead is the use of .apply(lambda x rather than a vectorized approach:

Line # Hits Time Per Hit % Time Line Contents

 2                                           def tst_fun(df, field):
 3         1          685    685.0      0.2      x = df[field]
 4         1        20726  20726.0      5.8      nans = x[np.isnan(x)]
 5         1        28448  28448.0      8.0      sub_df = x.dropna()
 6         1          387    387.0      0.1      nans_ranked = nans.replace(np.nan, 50)
 7         1            5      5.0      0.0      if len(sub_df.index) == 0: 
 8                                                   pass #check not empty.  May be empty due to nans for first 5 years e.g. no revenue/operating margin data pre 1990
 9                                                   return nans_ranked
10                                           
11         1        65559  65559.0     18.4      if len(sub_df.unique()) == 1: 
12                                                   sub_df[:] = 50 #e.g. for subranks where all factors had nan so ranked as 50 e.g. in 1990
13                                                   return sub_df
14                                           
15                                               #Finally, check that we don't have too many clustered values, such that we can't bin, and reduce bin size provided we can at least quintile rank.
16         1        74610  74610.0     20.9      max_cluster = sub_df.value_counts().iloc[0] #value_counts sorts by counts, so first element will contain the max
17                                               # print(counts)
18         1            9      9.0      0.0      max_bins = len(sub_df) / max_cluster #
19                                           
20         1            3      3.0      0.0      if max_bins > 100: 
21         1            0      0.0      0.0          max_bins = 100 #if largest cluster <1% of available data, then we can percentile_rank
22                                           
23                                           
24         1            0      0.0      0.0      if max_bins < 5: 
25                                                   sub_df[:] = 50 #if we don't have the resolution to quintile rank then assume no data.
26                                           
27                                               #     return sub_df
28                                           
29         1            1      1.0      0.0      bins = int(max_bins) # bin using highest resolution that the data supports, subject to constraints above (max 100 bins, min 5 bins)
30                                           
31                                               #should track bin resolution for all data.  To add.
32                                           
33                                               #if get here, then neither nans_ranked, nor sub_df are empty
34                                               # sub_df_ranked = pd.qcut(sub_df, bins, labels=False)
35         1       160530 160530.0     45.0      sub_df_ranked = (sub_df.rank(ascending = True) - 1)*100/len(x)
36                                           
37         1         5777   5777.0      1.6      ranked_df = pd.concat([sub_df_ranked, nans_ranked])
38                                               
39         1            1      1.0      0.0      return ranked_df
Carl
  • 598
  • 2
  • 11
  • 25
  • 2
    Did you think about using multiprocessing in order to run the lambda statement faster? I don't know how well pandas handle multiprocessing/multithreading but I think you should give it a try. – Alon Alexander May 20 '17 at 17:29
  • Thanks, that is an interesting idea. Still, it must be possible to vectorize my 'loop'! – Carl May 21 '17 at 15:10
  • Numba may be able to vectorize your ranking functions. – Kyle May 23 '17 at 20:36
  • 1
    I haven't spent enough time on this to have an excellent answer, but have you attempted to put your data into columns that can but run over in parallel, and then pass those values through to vectorized functions, like `bn.nanrankdata`? That way, you don't need to call out to python `n` times, you can stay in C code. But it's dependent on being able to have a function that can run over each column atomically. Can you do that? – Maximilian May 24 '17 at 20:03
  • I dont know exactly but if it works as `map`, maybe without enclosing the function on a lambda will run faster, `ranked = df[to_rank].apply(f)` – Netwave May 25 '17 at 08:01

2 Answers2

3

I'd build a function using numpy
I plan on using this within each group defined within a pandas groupby

def rnk(df):
    a = df.values.argsort(0)
    n, m = a.shape
    r = np.arange(a.shape[1])
    b = np.empty_like(a)
    b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
    return pd.DataFrame(b / n, df.index, df.columns)

gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']
df.groupby(gcols)[rcols].apply(rnk).add_suffix('_ranked')

   var_1_ranked  var_2_ranked  var_3_ranked
0      0.333333      0.809524      0.428571
1      0.160000      0.360000      0.240000
2      0.153846      0.384615      0.461538
3      0.000000      0.315789      0.105263
4      0.560000      0.200000      0.160000
...

How It Works

  • Because I know that ranking is related to sorting, I want to use some clever sorting to do this quicker.
  • numpy's argsort will produce a permutation that can be used to slice the array into a sorted array.

    a = np.array([25, 300, 7])
    b = a.argsort()
    print(b)
    
    [2 0 1]
    
    print(a[b])
    
    [  7  25 300]
    
  • So, instead, I'm going to use the argsort to tell me where the first, second, and third ranked elements are.

    # create an empty array that is the same size as b or a
    # but these will be ranks, so I want them to be integers
    # so I use empty_like(b) because b is the result of 
    # argsort and is already integers.
    u = np.empty_like(b)
    
    # now just like when I sliced a above with a[b]
    # I slice u the same way but instead I assign to
    # those positions, the ranks I want.
    # In this case, I defined the ranks as np.arange(b.size) + 1
    u[b] = np.arange(b.size) + 1
    
    print(u)
    
    [2 3 1]
    
  • And that was exactly correct. The 7 was in the last position but was our first rank. 300 was in the second position and was our third rank. 25 was in the first position and was our second rank.

  • Finally, I divide by the number in the rank to get the percentiles. It so happens that because I used zero based ranking np.arange(n), as opposed to one based np.arange(1, n+1) or np.arange(n) + 1 as in our example, I can do the simple division to get the percentiles.
  • What's left to do is apply this logic to each group. We can do this in pandas with groupby
  • Some of the missing details include how I use argsort(0) to get independent sorts per column` and that I do some fancy slicing to rearrange each column independently.

Can we avoid the groupby and have numpy do the whole thing?
I'll also take advantage of numba's just in time compiling to speed up some things with njit

from numba import njit

@njit
def count_factor(f):
    c = np.arange(f.max() + 2) * 0
    for i in f:
        c[i + 1] += 1
    return c

@njit
def factor_fun(f):
    c = count_factor(f)
    cc = c[:-1].cumsum()
    return c[1:][f], cc[f]

def lexsort(a, f):
    n, m = a.shape
    f = f * (a.max() - a.min() + 1)
    return (f.reshape(-1, 1) + a).argsort(0)


def rnk_numba(df, gcols, rcols):
    tups = list(zip(*[df[c].values.tolist() for c in gcols]))
    f = pd.Series(tups).factorize()[0]
    a = lexsort(np.column_stack([df[c].values for c in rcols]), f)
    c, cc = factor_fun(f)
    c = c[:, None]
    cc = cc[:, None]
    n, m = a.shape
    r = np.arange(a.shape[1])
    b = np.empty_like(a)
    b[a, np.arange(m)[None, :]] = np.arange(n)[:, None]
    return pd.DataFrame((b - cc) / c, df.index, rcols).add_suffix('_ranked')

How it works

  • Honestly, this is difficult to process mentally. I'll stick with expanding on what I explained above.
  • I want to use argsort again to drop rankings into the correct positions. However, I have to contend with the grouping columns. So what I do is compile a list of tuples and factorize them as was addressed in this question here
  • Now that I have a factorized set of tuples I can perform a modified lexsort that sorts within my factorized tuple groups. This question addresses the lexsort.
  • A tricky bit remains to be addressed where I must off set the new found ranks by the size of each group so that I get fresh ranks for every group. This is taken care of in the tiny snippet b - cc in the code below. But calculating cc is a necessary component.

So that's some of the high level philosophy. What about @njit?

  • Note that when I factorize, I am mapping to the integers 0 to n - 1 where n is the number of unique grouping tuples. I can use an array of length n as a convenient way to track the counts.
  • In order to accomplish the groupby offset, I needed to track the counts and cumulative counts in the positions of those groups as they are represented in the list of tuples or the factorized version of those tuples. I decided to do a linear scan through the factorized array f and count the observations in a numba loop. While I had this information, I'd also produce the necessary information to produce the cumulative offsets I also needed.
  • numba provides an interface to produce highly efficient compiled functions. It is finicky and you have to acquire some experience to know what is possible and what isn't possible. I decided to numbafy two functions that are preceded with a numba decorator @njit. This coded works just as well without those decorators, but is sped up with them.

Timing

%%timeit 
ranked_cols = [col + '_ranked' for col in to_rank]
​
ranked = df[['date_id', 'category'] + to_rank].groupby(['date_id', 'category'], as_index = False).apply(lambda x: rank_fun(x, to_rank)) 
ranked.columns = ranked_cols        
ranked.reset_index(inplace = True)
ranked.set_index('level_1', inplace = True)  
1 loop, best of 3: 481 ms per loop

gcols = ['date_id', 'category']
rcols = ['var_1', 'var_2', 'var_3']

%timeit df.groupby(gcols)[rcols].apply(rnk_numpy).add_suffix('_ranked')
100 loops, best of 3: 16.4 ms per loop

%timeit rnk_numba(df, gcols, rcols).head()
1000 loops, best of 3: 1.03 ms per loop
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks for this, ~26x as fast! But it gives different ranks to my code, and also doesn't cater for nans? Also, I note that you use .apply(rnk) without a lambda x: is this a short cut? It would be really appreciated if you explained exactly how each line of rnk( works and what your code is doing as my numpy is quite weak. – Carl May 26 '17 at 13:37
  • Note to others: the ranks are the same (nan question aside), but my output ranks are sorted differently as you can see if you compare the index. e.g. my_ranks.var_1_ranked / new_ranks.var_1_ranked = 100 – Carl May 26 '17 at 13:54
  • I'll add more commentary when I get back to a computer. Also, how do you want nan handled? No rank or nan? Worst rank? – piRSquared May 26 '17 at 13:56
  • Thanks. Also, would it be even faster to use transform( rather than apply(? I gather that that is optimized so that some overheads are removed. – Carl May 26 '17 at 14:14
  • @Carl oh I've got something even better – piRSquared May 26 '17 at 14:15
  • Cool :). On nan, same logic as I use in my example above: all ranks from 0 - 99, with nans having values of 50. I'm looking to improve on this in fact: currently I reduce resolution to the maximum available given value ties: e.g. if I have 10% of the ex-nan data-set having the same value, I reduce the ranking resolution to decile from percentile. I'm also investigating using bottleneck's 'nanrank' and using average rank where there are ties, and I suspect that is also even faster than pd.qcut. – Carl May 26 '17 at 14:21
  • I hope you have a chance; I've spent most of today working on the solution suggested below, but it is only 3x faster than my current, inefficient solution. If you have any ideas on how to reflect the nan treatment using super fast numpy/something clever, that would be great! – Carl May 26 '17 at 21:06
  • I obviously can't spend all my time on it. But when I can, I'm giving it a go. I do have ideas. – piRSquared May 26 '17 at 21:07
  • Thanks for the latest edition, that is lightening fast! almost 500x the speed of my original function. But I don't understand the code. The ranking also seems to have changed, or does when I give it a dataframe containing some np.nans. And need to percentile rank only non-nans and then replace those nans with 50s – Carl May 27 '17 at 11:39
  • I wonder whether the nans are ranking incorrectly as per https://github.com/pandas-dev/pandas/issues/12694: it seems that s.argsort() is not exactly the same as np.argsort() ? The example in the link gives In [75]: s.values Out[75]: array([ 200., 100., 400., 500., nan, 300.]) In [76]: np.argsort(s.values) Out[76]: array([1, 0, 5, 2, 3, 4], dtype=int64) In [77]: s.argsort().values Out[77]: array([ 1, 0, 4, 2, -1, 3], dtype=int64) – Carl May 28 '17 at 22:19
  • 1
    I've added far more color to what I've done so far. I'll get to `nan`s a bit later. I've included a couple of links to other questions that have been asked in the effort to provide a better answer for you. Please feel free to upvote answers to those questions as well as those people are unwittingly helping you :-) – piRSquared May 30 '17 at 19:52
  • Thanks :). Yes, I generally do up-vote helpful questions, and helpful answers ;) – Carl May 30 '17 at 20:29
2

I suggest you try this code. It's 3 times faster than yours, and more clear.

rank function:

def rank(x):
    counts = x.value_counts()
    bins = int(0 if len(counts) == 0 else x.count() / counts.iloc[0])
    bins = 100 if bins > 100 else bins
    if bins < 5:
        return x.apply(lambda x: 50)
    else:
        return (pd.qcut(x, bins, labels=False) * (100 / bins)).fillna(50).astype(int)

single thread apply:

for col in to_rank:
    df[col + '_ranked'] = df.groupby(['date_id', 'category'])[col].apply(rank)

mulple thread apply:

import sys
from multiprocessing import Pool

def tfunc(col):
    return df.groupby(['date_id', 'category'])[col].apply(rank)

pool = Pool(len(to_rank))
result = pool.map_async(tfunc, to_rank).get(sys.maxint)
for (col, val) in zip(to_rank, result):
    df[col + '_ranked'] = val
xmduhan
  • 965
  • 12
  • 14
  • Thanks, I get an IndexError error e.g. IndexError: index 30 is out of bounds for axis 0 with size 6 with pd.qcut(x, bins, labels=False) – Carl May 26 '17 at 14:25
  • bins needs to be an integer for qcut, so fixing this error was as simple as changing the last line of the function to return (pd.qcut(x, int(bins), labels=False) – Carl May 26 '17 at 15:21
  • Interestingly using transform is slower than your loop! df.groupby(['date_id', 'category'])[to_rank].transform(rank) 1 loop, best of 3: 214 ms per loop. Any idea why? this seems counter-intuitive to me! – Carl May 26 '17 at 15:24
  • as I understand it (not well), the whole point of .transform( is that you can pass multiple columns/Series to your function in one go, and then each column is looped over independently, whereas with .apply(, the columns are passed individually which apparently has extra overheads. In your code, by looping over to_rank, you are doing a new groupby call and Series selection [col] for each column in to_rank, and yet despite this, it is still faster than transform! – Carl May 26 '17 at 15:42
  • According your remind, I correct some bug in code. And add mulple thread sample code, it will be a bit faster. – xmduhan May 27 '17 at 00:53
  • In pandas, single column apply Individually is usually much faster than multiple column apply together. – xmduhan May 27 '17 at 01:00
  • Thanks. Yes, sure, but I thought that .transform( was specifically designed to overcome that. – Carl May 27 '17 at 11:01
  • I have tried this, but I get the error "AttributeError: module 'sys' has no attribute 'maxint'". I gather that this has been deprecated (https://docs.python.org/3.1/whatsnew/3.0.html#integers). Using 'sys.maxsize' gives "OverflowError: timestamp too large to convert to C _PyTime_t". I don't understand parallel processing yet, if you have a simple explanation for your code, that would be appreciated! – Carl May 28 '17 at 19:41
  • This is cause by different python version, that I use python2 and you use python3. I think You just need to replace sys.maxsize with big int which is much more than you original execute time in seconds, example: 65535. – xmduhan May 29 '17 at 00:22
  • About parallel processing you should read python document:https://docs.python.org/3/library/multiprocessing.html, not need to read whole just understand first code sample is enough. And I use map_async instead of map just for easy canceling running process. see this for more infomation: https://stackoverflow.com/questions/1408356/keyboard-interrupts-with-pythons-multiprocessing-pool – xmduhan May 29 '17 at 00:29
  • I don't understand why you use map_async rather than apply_async, and, most importantly, how to close the processes after execution? – Carl Jun 02 '17 at 17:05
  • because I don't know there is a function called apply_async :-), and map_async is what I need. process will auto close after execute. – xmduhan Jun 03 '17 at 00:28
  • Ah thanks. Yes, everything you have posted has been a great help :). But I seem to have multiple spawned processes that don't auto-close ... I call the parallel processing code multiple times as I have several different 'levels' of ranking that rely on previous ranks. – Carl Jun 03 '17 at 13:22
  • also, doing result = pool.map_async(tfunc, to_rank).get() i.e. without the large integer e.g. 65535 i.e. not result = pool.map_async(tfunc, to_rank).get(65535) seemed a lot faster! – Carl Jun 03 '17 at 22:14
  • I can't exactly understand problem of parallel process. If you provide a sample code, I will try to do something futher. – xmduhan Jun 05 '17 at 01:52