4

I have a DataFrame with a list of arrays as one column.

  import pandas as pd

  v = [1, 2, 3, 4, 5, 6, 7]
  v1 = [1, 0, 0, 0, 0, 0, 0]
  v2 = [0, 1, 0, 0, 1, 0, 0]
  v3 = [1, 1, 0, 0, 0, 0, 1]

  df = pd.DataFrame({'A': [v1, v2, v3]})

  print df

Output:

                       A
0  [1, 0, 0, 0, 0, 0, 0]
1  [0, 1, 0, 0, 1, 0, 0]
2  [1, 1, 0, 0, 0, 0, 1]

I want to do a pd.Series.corr for each row of df.A against the single vector v. I'm currently doing a loop on df.A and achieving it. It is very slow.

Expected Output:

                       A         B
0  [1, 0, 0, 0, 0, 0, 0]  -0.612372 
1  [0, 1, 0, 0, 1, 0, 0]  -0.158114
2  [1, 1, 0, 0, 0, 0, 1]  -0.288675 
Divakar
  • 218,885
  • 19
  • 262
  • 358
revendar
  • 371
  • 2
  • 3
  • 12

2 Answers2

4

Here's one using the correlation defintion with NumPy tools meant for performance with corr2_coeff_rowwise -

a = np.array(df.A.tolist()) # or np.vstack(df.A.values)
df['B'] = corr2_coeff_rowwise(a, np.asarray(v)[None])

Runtime test -

Case #1 : 1000 rows

In [59]: df = pd.DataFrame({'A': [np.random.randint(0,9,(7)) for i in range(1000)]})

In [60]: v = np.random.randint(0,9,(7)).tolist()

# @jezrael's soln
In [61]: %timeit df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
10 loops, best of 3: 142 ms per loop

In [62]: %timeit df['B'] = corr2_coeff_rowwise(np.array(df.A.tolist()), np.asarray(v)[None])
1000 loops, best of 3: 461 µs per loop

Case #2 : 10000 rows

In [63]: df = pd.DataFrame({'A': [np.random.randint(0,9,(7)) for i in range(10000)]})

In [64]: v = np.random.randint(0,9,(7)).tolist()

# @jezrael's soln
In [65]: %timeit df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
1 loop, best of 3: 1.38 s per loop

In [66]: %timeit df['B'] = corr2_coeff_rowwise(np.array(df.A.tolist()), np.asarray(v)[None])
100 loops, best of 3: 3.05 ms per loop
Divakar
  • 218,885
  • 19
  • 262
  • 358
2

Use corrwith, but if performance is important, Divakar's anwer should be faster:

df['new'] = pd.DataFrame(df['A'].values.tolist()).corrwith(pd.Series(v), axis=1)
print (df)
                       A       new
0  [1, 0, 0, 0, 0, 0, 0] -0.612372
1  [0, 1, 0, 0, 1, 0, 0] -0.158114
2  [1, 1, 0, 0, 0, 0, 1] -0.288675
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252