2

I have a data set that contain numeric values. I'd like to measure the correlation between the columns

Let's consider :

dataset = pd.DataFrame({'A':np.random.rand(100)*1000, 
                        'B':np.random.rand(100)*100,  
                        'C':np.random.rand(100)*10, 
                        't':np.random.rand(100)})

Mathematically, non-correlated data means that cov(a,b) = 0. But with real data, it should be near to zero.

np.cov(a,b)

this numpy should give us the covariance value between two. but I'd like to make sure that my dataset is not correlated, any trick to do that ?

UPDATE

from matplotlib.mlab import PCA
results = PCA(dataset.values) 
user3378649
  • 5,154
  • 14
  • 52
  • 76
  • I am not sure that I understand your question, but in the last part you say that you want to make sure that your data is not correlated. You apply Principal Component Analysis (PCA) to any dataset, the resulting principal components are not correlated by definition. – Akavall Apr 24 '14 at 14:35
  • @ Akavall: Yes, I wanna find if any two columns A,B,C,t are correleted. I have a huge data-set with (20 columns * 10K), so I need to see if it's correlated. – user3378649 Apr 24 '14 at 15:00
  • @ Akavall: i updated the post based on what u said; how can I interpret "results" in this case – user3378649 Apr 24 '14 at 15:02
  • This is my favorite tutorial for PCA http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf, However I am not sure if this is a solution to your problem. – Akavall Apr 24 '14 at 15:06
  • Basically you have 4 matrices and you are looking for highly correlated pairs between them, can 4 matrices be transformed to 1 bigger matrix for simplicity? – Akavall Apr 24 '14 at 15:09
  • 10000 does not seem like a huge number, your should be able to create a correlation matrix which will store 100,000,000 elements, then you can find the correlated pairs. – Akavall Apr 24 '14 at 15:11

1 Answers1

2

I have a covariance code snipet that I refer to:

    mean =  np.mean(matrix,axis=0)
    # make a mean matrix the same shape as data for subtraction
    mean_mat = np.outer(np.ones((nsamples,1)),mean)

    cov = matrix - mean_mat
    cov = np.dot(cov.T,cov)/(nsamples -1)

cov is the numpy array, mean is the mean in the row direction.

Note the matrix doesn't need to be square.

Then you can use the Covariance matrix to " take out the variance" by multiplying the data by the inverse covariance using the Penrose pseudo inverse:

        U,S,V = svd(cov)
        D = np.diag(1./S)
        # inv = VT.(D^-1).UT
        # where cov = U.D.V
        inverse_cov = np.dot(V.T,np.dot(D,U.T)) 
Felix Castor
  • 1,598
  • 1
  • 18
  • 39
  • Thanks! do you think that is better to use np.corrcoef(a,b) between all the vecteurs inside the matrix. I need to detect the "highly" correlated vecteurs inside the matrix. – user3378649 Apr 24 '14 at 17:47
  • The correlation matrix is the extension of the correlation coefficient to a multivariate system. If a and b are single samples from a multivariate system, finding the correlation coefficient wouldn't provide much information. – Felix Castor Apr 24 '14 at 18:50
  • Maybe normed Euclidean distance between vectors? This would tell you the vectors that are most alike. The smaller the distance the more similar they are. – Felix Castor Apr 25 '14 at 12:36
  • yeah, actually this is what I've done ! I ended up doing this two approaches using coeffcient correlation between every two vecteurs and Euclidien distance – user3378649 Apr 25 '14 at 13:08
  • There is also [Mahalanobis Distance](http://en.wikipedia.org/wiki/Mahalanobis_distance). Are you working with a multivariate system? – Felix Castor Apr 25 '14 at 14:33
  • Yes, I have multivariate system. What I've done is finding all the 2-tuples combinations of vectors in the matrix. then I evaluate np.corrcoef(Ai,Aj). What is the best approach ? – user3378649 Apr 25 '14 at 15:56
  • I would collect as many samples as possible (vectors) and use those to define a covariance matrix. Then use the covariance matrix and Mahalanobis distance between each vector that you want to check the similarities of. The corrcoef() is more for a set of two single variable systems over a number of samples say X = {x1,x2,...,xn} and Y = {y1,y2,...,yn} where yn and xn represent a similar measurement. xn maybe the input yn maybe the output. What you have I think is H = {X1,X2,....,Xm} where X1 = {x1,x2,..,xn}. The correlation matrix is what you will need, so no the corrcoef doesn't make sense – Felix Castor Apr 25 '14 at 17:07
  • In this case, if I use The correlation matrix, how can I detect the vectors that are "highly" correlated. The other question, how can I calculate the correlation matrix (using http://stackoverflow.com/questions/3437513/finding-the-correlation-matrix) or just using numpy.corrcoef(matrix) – user3378649 Apr 25 '14 at 17:14
  • Ok I see now. That does make sense. – Felix Castor Apr 25 '14 at 18:58
  • @ Felix Castor : So, what do you recommend ? did I do the right thing ? – user3378649 Apr 25 '14 at 19:00