-2

np.corrcoef takes two arguments and they must have the same dimensions. In my case datax is a n by n array and datay is a n by 1 array. I want to vectorize this operation so I don't have to use loops to find my results. I think np.vectorize is my answer but nothing I have tried gives me a result. Here is my last try in things I have tried:

def f(datax, datay):
            return np.corrcoef(data,datay)

    result = np.vectorize(f, dtype=np.ndarray)
desertnaut
  • 57,590
  • 26
  • 140
  • 166
  • The question is missing important information about the alignment of *datax*. Supposing you want to broadcast *datay* like `np.corrcoef(datax, datay.T)[:,n]`, `vectorize` could be written as `np.vectorize(lambda x,y: np.corrcoef(x, y, rowvar=False), excluded='y', signature='(n),(m,1)->(k,k)')(datax, datay)`. Please clarify the question. – Michael Szczesny Sep 18 '22 at 21:40
  • 1
    What exactly is the problem? Errors? Wrong result? And read ALL of the docs before trying anything else. (Using `dtype` instead of `otypes` tells me you didn't read anything!) – hpaulj Sep 18 '22 at 21:44

1 Answers1

2

np.vectorize() is not really for performance. Most numpy operations are vectorized anyway.

I assume you're trying to calculate correlations to y columnwise.

Let's test it out (I used a small 400-ish lines dataframe), a naive for loop would be relatively slow indeed:

%%timeit
[np.corrcoef(X_train[:,i], y_train)[0,1] for i in range(10)]
459 µs ± 1.45 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

A 'proper' vectorized version should do something like:

def f(datax, datay):
            return np.corrcoef(datax, datay, rowvar=False)
result = np.vectorize(f, signature="(m,n),(m)->(k,k)")

%%timeit
result(X_train, y_train)[-1,0:X_train[0].size]
121 µs ± 84.4 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

Much better! But alas, np.corrcoef() is already better vectorized:

%%timeit
np.corrcoef(X_train, y_train, rowvar=False)[-1,0:X_train[0].size]
64.7 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

That's basically twice as fast.

If you really wish to speed it up however, einsum comes to mind: (adapted from this question)

def columnwisecorrcoef(O, P):
    n = np.double(P.size)
    DO = O - (np.einsum('ij->j', O) / n)
    PO = P - (np.einsum('i->', P) / n)
    tmp = np.einsum('ij,ij->j', DO, DO)
    tmp *= np.einsum('i,i->', PO, PO)
    return np.dot(PO, DO) / np.sqrt(tmp)
    
%%timeit
columnwisecorrcoef(X_train, y_train)
24.8 µs ± 45.1 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
dx2-66
  • 2,376
  • 2
  • 4
  • 14