1

I need to independently shuffle each row of large pandas DataFrames several times (the typical shape is (10000,1000)) and then estimate the correlation of each row with a given series.

The most efficient (=quick) way I found to do it staying within pandas is the following:

for i in range(N): #the larger is N, the better it is
    df_sh = df.apply(numpy.random.permutation, axis=1)
    #where df this is my large dataframe, with 10K rows and 1K columns

    corr = df_sh.corrwith(s, axis = 1)
    #where s is the provided series (shape of s =(1000,))

The two tasks take approximately the same amount of time (namely 30 secs each). I tried to convert my dataframe into a numpy.array, to perform a for loop over the array and, for each line, I first perform the permutation and then measure the correlation with scipy.stats.pearsonr. I unfortunately managed to speed up my two tasks only by a factor 2. There would be other viable options to speed up the tasks even more? (NB: I am already parallelizing with Joblib the execution of my code up to the maximum factor allowed by the machine I am using).

Divakar
  • 218,885
  • 19
  • 262
  • 358
SirC
  • 2,101
  • 4
  • 19
  • 24

1 Answers1

3

Correlation between a 2D matrix/array and 1D array/vector :

We can adapt corr2_coeff_rowwise for correlation between a 2D array/matrix and a 1D array/vector, like so -

def corr2_coeff_2d_1d(A, B):
    # Rowwise mean of input arrays & subtract from input arrays themeselves
    A_mA = A - A.mean(1,keepdims=1)
    B_mB = B - B.mean()

    # Sum of squares across rows
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    ssB = B_mB.dot(B_mB)

    # Finally get corr coeff
    return A_mA.dot(B_mB)/np.sqrt(ssA*ssB)

To shuffle each row and do this for all rows, we can make use of np.random.shuffle. Now, this shuffle function works along the first axis. So, to solve our case, we need to feed in the transposed version. Also, note that this shuffling would be done in-place. So, if original dataframe is needed elsewhere, do make a copy before processing. Thus, the solution would be -

Hence, let's use this to solve our case -

# Extract underlying arry data for faster NumPy processing in loop later on    
a = df.values  
s_ar = s.values

# Setup array for row-indexing with NumPy's advanced indexing later on
r = np.arange(a.shape[0])[:,None]

for i in range(N):
    # Get shuffled indices per row with `rand+argsort/argpartition` trick from -
    # https://stackoverflow.com/a/45438143/
    idx = np.random.rand(*a.shape).argsort(1)

    # Shuffle array data with NumPy's advanced indexing
    shuffled_a = a[r, idx]

    # Compute correlation
    corr = corr2_coeff_2d_1d(shuffled_a, s_ar)

Optimized version #1

Now, we could pre-compute for the parts involving the series that stays the same between iterations. Hence, a further optimized version would look like this -

a = df.values  
s_ar = s.values
r = np.arange(a.shape[0])[:,None]

B = s_ar
B_mB = B - B.mean()
ssB = B_mB.dot(B_mB)

A = a
A_mean = A.mean(1,keepdims=1)

for i in range(N):
    # Get shuffled indices per row with `rand+argsort/argpartition` trick from -
    # https://stackoverflow.com/a/45438143/
    idx = np.random.rand(*a.shape).argsort(1)

    # Shuffle array data with NumPy's advanced indexing
    shuffled_a = a[r, idx]

    # Compute correlation
    A = shuffled_a
    A_mA = A - A_mean
    ssA = np.einsum('ij,ij->i',A_mA,A_mA)
    corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)

Benchmarking

Setup inputs with actual use-case shapes/sizes

In [302]: df = pd.DataFrame(np.random.rand(10000,1000))

In [303]: s = pd.Series(df.iloc[0])

1. Original method

In [304]: %%timeit
     ...: df_sh = df.apply(np.random.permutation, axis=1)
     ...: corr = df_sh.corrwith(s, axis = 1)
1 loop, best of 3: 1.99 s per loop

2. Proposed method

The pre-processing part (only done once before starting loop, so not including in timings) -

In [305]: a = df.values  
     ...: s_ar = s.values
     ...: r = np.arange(a.shape[0])[:,None]
     ...: 
     ...: B = s_ar
     ...: B_mB = B - B.mean()
     ...: ssB = B_mB.dot(B_mB)
     ...: 
     ...: A = a
     ...: A_mean = A.mean(1,keepdims=1)

Part of proposed solution that runs in loop -

In [306]: %%timeit
     ...: idx = np.random.rand(*a.shape).argsort(1)
     ...: shuffled_a = a[r, idx]
     ...: 
     ...: A = shuffled_a
     ...: A_mA = A - A_mean
     ...: ssA = np.einsum('ij,ij->i',A_mA,A_mA)
     ...: corr = A_mA.dot(B_mB)/np.sqrt(ssA*ssB)
1 loop, best of 3: 675 ms per loop

Thus, we are seeing a speedup of around 3x here!

Divakar
  • 218,885
  • 19
  • 262
  • 358
  • I'll test it tomorrow and let you know, but it sounds very promising, I had not thought of randomizing the index of the matrix rather than the entries. – SirC Dec 14 '17 at 15:50
  • At the beginning I went for `np.random.shuffle(a.T)` as well but unfortunately I realized that it wasn't randomizing each line independently i.e. it does `[[1,2,3],[4,5,6]] -> [[1,3,2],[4,6,5]]` if the i-th entries of line k is moved to position j then position i of line h will be move in the same position, that is why I was looping on `a` using your notation. I can't find a way to avoid to perform such loop. – SirC Dec 15 '17 at 16:21
  • and that's why I liked your original answer which was randomizing indexes. – SirC Dec 15 '17 at 16:22