Correlation between columns in DataFrame

Question

I'm pretty new to pandas, so I guess I'm doing something wrong -

I have a DataFrame:

     a     b
0  0.5  0.75
1  0.5  0.75
2  0.5  0.75
3  0.5  0.75
4  0.5  0.75

df.corr() gives me:

    a   b
a NaN NaN
b NaN NaN

but np.correlate(df["a"], df["b"]) gives: 1.875

Why is that? I want to have the correlation matrix for my DataFrame and thought that corr() does that (at least according to the documentation). Why does it return NaN?

What's the correct way to calculate?

Many thanks!

unutbu · Accepted Answer · 2013-04-07T13:45:54.023

np.correlate calculates the (unnormalized) cross-correlation between two 1-dimensional sequences:

z[k] = sum_n a[n] * conj(v[n+k])

while df.corr (by default) calculates the Pearson correlation coefficient.

The correlation coefficient (if it exists) is always between -1 and 1 inclusive. The cross-correlation is not bounded.

The formulas are somewhat related, but notice that in the cross-correlation formula (above) there is no subtraction of the means, and no division by the standard deviations which is part of the formula for Pearson correlation coefficient.

The fact that the standard deviation of df['a'] and df['b'] is zero is what causes df.corr to be NaN everywhere.

From the comment below, it sounds like you are looking for Beta. It is related to Pearson's correlation coefficient, but instead of dividing by the product of standard deviations:

enter image description here

you divide by a variance:

enter image description here

You can compute Beta using np.cov

cov = np.cov(a, b)
beta = cov[1, 0] / cov[0, 0]

import numpy as np
import matplotlib.pyplot as plt
np.random.seed(100)


def geometric_brownian_motion(T=1, N=100, mu=0.1, sigma=0.01, S0=20):
    """
    http://stackoverflow.com/a/13203189/190597 (unutbu)
    """
    dt = float(T) / N
    t = np.linspace(0, T, N)
    W = np.random.standard_normal(size=N)
    W = np.cumsum(W) * np.sqrt(dt)  # standard brownian motion ###
    X = (mu - 0.5 * sigma ** 2) * t + sigma * W
    S = S0 * np.exp(X)  # geometric brownian motion ###
    return S

N = 10 ** 6
a = geometric_brownian_motion(T=1, mu=0.1, sigma=0.01, N=N)
b = geometric_brownian_motion(T=1, mu=0.2, sigma=0.01, N=N)

cov = np.cov(a, b)
print(cov)
# [[ 0.38234755  0.80525967]
#  [ 0.80525967  1.73517501]]
beta = cov[1, 0] / cov[0, 0]
print(beta)
# 2.10609347015

plt.plot(a)
plt.plot(b)
plt.show()

enter image description here

The ratio of mus is 2, and beta is ~2.1.

And you could also compute it with df.corr, though this is a much more round-about way of doing it (but it is nice to see there is consistency):

import pandas as pd
df = pd.DataFrame({'a': a, 'b': b})
beta2 = (df.corr() * df['b'].std() * df['a'].std() / df['a'].var()).ix[0, 1]
print(beta2)
# 2.10609347015
assert np.allclose(beta, beta2)

Thanks! So in case my "a" and "b" are daily changes of prices and I want to measure how does "b" does with respect to "a" (meaning - if every time "a" goes up by 1% b goes up by 2% I'd expect to see 2.0, if "b" is always -0.5% I'd expect -0.5). I guess the 'cross-correlation' is what I want, right? — Zach Moshe, Apr 07 '13 at 06:30

Correlation between columns in DataFrame

1 Answers1

Linked