1

I have a dataframe that contains float values. A new dataframe needs to be produced with the ranking of all of these values. Example below:

import pandas as pd
import numpy as np
import numba as nb
@nb.njit('int32[:,:](float64[:,:])', parallel=True)
def fastRanks(df):
    n, m = df.shape
    res = np.empty((n, m), dtype=np.int32)

    for col in nb.prange(m):
        dfCol = -df[:, col]
        order = np.argsort(dfCol)

        # Compute the ranks with the min method
        if n > 0:
            prevVal = dfCol[order[0]]
            prevRank = 1
            res[order[0], col] = 1

            for row in range(1, n):
                curVal = dfCol[order[row]]
                if curVal == prevVal:
                    res[order[row], col] = prevRank
                else:
                    res[order[row], col] = row + 1
                    prevVal = curVal
                    prevRank = row + 1

    return res
df = pd.DataFrame(np.random.uniform(0,50,size=(100000, 5000)), columns=list(range(0,5000)))
%%time
ranking = pd.DataFrame(range(1, 100000 + 1), columns=['index'])
ranking = pd.concat([ranking, pd.DataFrame(fastRanks(df[range(0, 5000 )].to_numpy()))],
                    axis=1)

This ends up taking about 24 seconds to run.

Anyone have any suggestions on how to speed this up at all?

Scott Woodhall
  • 185
  • 1
  • 1
  • 8
  • Did you try pandas built-in [DataFrame.rank](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rank.html) which may interface with numpy and even C-level calls? – Parfait Sep 28 '22 at 15:53

1 Answers1

0

The first main issue is that the dataframe of the benchmark is stored with a row-major ordering while your algorithm needs a column-major ordering to be fast. Indeed, Numpy use the row-major ordering by default. Pandas generally generates column-major ordered arrays internally, but it does not change the ordering when a dataframe is built from a Numpy array. The thing is accessing data non-contiguously is much slower (mainly because of the CPU caches). Thus, the first thing to do is to generate a column-major dataframe. You can easily do that from a Numpy array with your_array.T.copy().T. You can check the strides with your_array.strides. Note that transposing data is slow (you can speed this up with Numba though).

Additionally, df[range(0, 5000 )].to_numpy() is not efficient since it creates a new temporary dataframe. You can just do: df.to_numpy()[:,0:5000] which only create a view. One should keep in mind that copying data is slow (because of the RAM).

Moreover, pd.concat is slow since it also creates a new temporary dataframe. This can generally be avoided by creating the dataframe and setting columns manually. That being said, if the types of the columns are not all the same, using Numba can be a bit tricky. One solution is to create a Numba typed-list of Numpy array view referencing the (empty) dataframe and fill data from Numba based on this list.

Finally, operations like -df[:, col] or np.argsort create temporary array making the computation more memory bound. The last operation tends to make random memory access so it might also be memory bound (especially if one column do not fit in the L2 cache or if N columns computed in parallel do not fit in the L3 cache). The thing is memory-bound code generally does not scale well. You can try to implement an optimized in-place argsort so to fix this but this is far from being easy. You can also operate directly on the input (in-place computation) using a basic loop so to avoid new creating arrays (newly created array tends to be slow to fill due to page-faults and other complicated low-level stuff). Note that the best speed up should come from the implementation of a fast sorting algorithm. Using a radix sort should help the get faster performance (assuming you do not care about the sort to be unstable but it should already not be stable anyway).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • So essentially transpose the array and change the ranking function to do "row" based ranking as opposed to column based? – Scott Woodhall Sep 29 '22 at 16:20
  • The transposition is a dirty fix that will not improve much performance and may even slow down the program. This is fine if you do not include this part in the benchmark. Otherwise, you need to generate data directly in the good layout. In practice, pandas tends to generate it correctly except when it is generate from Numpy array like you do. The two other main points for performance are avoiding copies and optimizing the sort (the last one is the most important). – Jérôme Richard Sep 30 '22 at 16:54