0

I work with large data sheets, in which I am trying to correlate all of the columns.

I achieve this using:

df = df.rolling(5).corr(pairwise = True)

This produces data like this:

477 

s1  -0.240339   0.932141    1.000000    0.577741    0.718307    -0.518748   0.772099 
s2  0.534848    0.626280    0.577741    1.000000    0.645064    -0.455503   0.447589 
s3  0.384720    0.907782    0.718307    0.645064    1.000000    -0.831378   0.406054
s4  -0.347547   -0.651557   -0.518748   -0.455503   -0.831378   1.000000    -0.569301 
s5  -0.315022   0.576705    0.772099    0.447589    0.406054    -0.569301   1.000000 

for each row contained in the data set. 477 in this case being the row number or index, and s1 - s5 being the column titles.

The goal is to find when the sensors are highly correlated with each other. I want to achieve this by (a) calculating the correlation using a rolling window of 5 rows using the code above, and (b) for each row produced, i.e i = 0 to i = 500 for a 500 row excel sheet, sum the tables dataframe.rolling(5).corr() produces for each value of i, i.e. produce one value per unit time such as in the graph included at the bottom. I am new to stackoverflow so please let me know if there's more information I can provide.

Example code + data:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

d = {'col1': [-2587.944231, -1897.324231,-2510.304231,-2203.814231,-2105.734231,-2446.964231,-2963.904231,-2177.254231, 2796.354231,-2085.304231], 'col2': [-3764.468462,-3723.608462,-3750.168462,-3694.998462,-3991.268462,-3972.878462,3676.608462,-3827.808462,-3629.618462,-1841.758462,], 'col3': [-166.1357692,-35.36576923, 321.4157692,108.9257692,-123.2257692, -10.84576923, -100.7457692, 89.27423077, -211.0857692, 101.5342308]}

df = pd.DataFrame(data=d)

dfn = df.rolling(5).corr(pairwise = True)

MATLAB code which accomplishes what I want:

% move through the data and get a correlation for 5 data points

for i=1:ns-4 C(:,:,i)=corrcoef(X(i:i+4,:));

    cact(i)=sum(C(:,:,i),'all')-nv; % subtracting nv removes the diagaonals that are = 1 and dont change

end

For the original data, the following is the graph I am trying to produce in Python, where the x axis is time: Correlation Graph

ultruction
  • 15
  • 7
  • 1
    Welcome to Stack Overflow! Please include a _small_ subset of your data as a __copyable__ piece of code that can be used for testing as well as your expected output for the __provided__ data. See [MRE - Minimal, Reproducible, Example](https://stackoverflow.com/help/minimal-reproducible-example), and [How to make good reproducible pandas examples](https://stackoverflow.com/q/20109391/15497888) for more information. – Henry Ecker May 30 '21 at 16:33
  • several questions to this: (1) what does large data sheet mean for you in this context? (2) When applying `.rolling()` this way, you would - theoretically - apply the correlation function calculation only to 5 rows at a time. Is this intended? (if you would want to do this by column, you would need the parameter `axis = 1`) (3) is your dataframe nested? (4) what do you try to achive with this output since `pandas.DataFrame.corr()` should do the whole dataframe pairwise already? – Racooneer May 30 '21 at 16:35
  • `sum the table of correlation for each integer index` - could you explain that? You want to perform that operation on each individual *index*, 477? you might want to include minimal example of the original Dataframe and the expected result of the operation. – wwii May 30 '21 at 16:35
  • Large dataframe ---> excel table with 55 columns, and about 500 rows. I am trying to figure out how to post a snippet of the data I am working with into my question. I will edit my question to reflect what you guys have commented, thank you! – ultruction May 30 '21 at 16:37
  • [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). You don't need all 55 columns or 500 rows. Just enough to illustrate what you are trying to do. It can even be fake data with random values. – wwii May 30 '21 at 16:39
  • Please include any relevant information [as text directly into your question](https://stackoverflow.com/editing-help), do not link or embed external images of source code or data. Images make it difficult to efficiently assist you as they cannot be copied and offer poor usability as they cannot be searched. See: [Why not upload images of code/errors when asking a question?](https://meta.stackoverflow.com/q/285551/15497888) – Henry Ecker May 30 '21 at 16:47
  • Please do not expect us to read and transcribe that 25 x 5 code snippet. – Henry Ecker May 30 '21 at 16:48
  • I edited my post, please let me know if that is sufficient. Thanks for your help! Sorry I'm new. – ultruction May 30 '21 at 17:05
  • This is _much_ better. Last thing is, what do you _want_ to get that your current code is not giving you. "sum the tables" how? Row-wise, col-wise, both directions? – Henry Ecker May 30 '21 at 17:08
  • both directions. picture a table produced by dataframe.corr(). Basically sum the entire table in both directions, and subtract the diagonal of 1's which is the sensors correlated with themselves. With the rolling correlation im doing, I want to do that for each window of 5, so that I can graph it and see when they're correlated. I'll include above MATLAB code which accomplishes the same but I cant figure out how to do in python. Let me know if that makes sense. Basically I want to change it from the current output, to ONE output per unit time to plot it. – ultruction May 30 '21 at 17:12
  • So after *summing* the correlation table for each *row* the result is a single number? Not one number for each sensor? – wwii May 31 '21 at 13:52
  • correct. I would expect a single number for each row. This is the crux of the problem, I am capable of getting one number for each sensor by collapsing say a 3x3 into a 3x1 per unit time, however I cannot get the 3x3 into 1 value. Please let me know if that makes sense. – ultruction May 31 '21 at 13:55
  • So for your first example, row 470, the result is 4.791859. – wwii May 31 '21 at 14:00
  • I'm matlab ignorant - in **that** code is `nv` the number of sensors/columns? – wwii May 31 '21 at 14:02
  • Yes! row 470 would be equal to 4.79. exactly. – ultruction May 31 '21 at 14:10
  • sorry I'm also MATLAB ignorant, that is my research supervisors code. but yes nv would be equal to the number of sensors/columns. the 'all' function seen within the sum function in the MATLAB code conveniently sums entire matrices which he produces one at a time using that for loop. – ultruction May 31 '21 at 14:13
  • @ultruction - I added a Pandas only solution to the end of my answer, if you would prefer that. – wwii May 31 '21 at 15:43

1 Answers1

0

sum the entire table in both directions, and subtract the diagonal of 1's which is the sensors correlated with themselves.

Using your dfn row four is

>>> dfn.loc[4]   
          col1      col2      col3
col1  1.000000 -0.146977 -0.227059
col2 -0.146977  1.000000  0.435216
col3 -0.227059  0.435216  1.000000

You can sum the complete table using Numpy's ndarray.sum() on the underlying data

>>> dfn.loc[4].to_numpy().sum()
3.1223603416753103

Then assuming the correlation table is square you just need to subtract the number of columns/sensors. If there isn't already a variable you can use the shape of the underlying numpy array.

>>> v = dfn.loc[4].to_numpy()
>>> v.shape
(3, 3)
>>> v.sum() - v.shape[0]
0.12236034167531029
>>>

without using the numpy array, you could sum the correlation table twice before subtracting.

>>> four = dfn.loc[4] 
>>> four.sum().sum()
3.1223603416753103
>>> four.sum().sum() - four.shape[0]
0.12236034167531029

Get the numpy array of the whole rolling sum correlation and reshape it to get separate correlations for each original row

n_sensors = 3
v = dfn.to_numpy()  # v.shape = (30,3)
new_dims = df.shape[0], n_sensors, n_sensors
v = v.reshape(new_dims) # shape = (10,3,3)
print(v[4])

 [[ 1.         -0.14697697 -0.22705934]
 [-0.14697697  1.          0.43521648]
 [-0.22705934  0.43521648  1.        ]]

Sum across the last two dimensions and subtract the number of sensors

result = v.sum((1,2)) - n_sensors
print(result)

[nan, nan, nan, nan, 0.12236034, 0.25316027, -2.40763192, -1.9370202, -2.28023618, -2.57886457]

There is probably a way to do that in Pandas but I'd have to work on that to figure it out. Maybe someone will answer with an all Pandas solution.


The rolling average correlation DataFrame has a multiindex

>>> dfn.index
MultiIndex([(0, 'col1'),
            (0, 'col2'),
            (0, 'col3'),
            (1, 'col1'),
            (1, 'col2'),
            (1, 'col3'),
            (2, 'col1'),
            (2, 'col2'),
            (2, 'col3'),
            ...

With a quick review of the MultiIndex docs and a search using pandas multi index sum on level 0 site:stackoverflow.com I came up with - group by level 0 and sum then sum again along the columns.

>>> four_five = dfn.loc[[4,5]]
>>> four_five
            col1      col2      col3
4 col1  1.000000 -0.146977 -0.227059
  col2 -0.146977  1.000000  0.435216
  col3 -0.227059  0.435216  1.000000
5 col1  1.000000  0.191238 -0.644203
  col2  0.191238  1.000000  0.579545
  col3 -0.644203  0.579545  1.000000
>>> four_five.groupby(level=0).sum()
       col1      col2      col3
4  0.625964  1.288240  1.208157
5  0.547035  1.770783  0.935343
>>> four_five.groupby(level=0).sum().sum(1)
4    3.12236
5    3.25316
dtype: float64
>>>

Then for the complete DataFrame.

>>> dfn.groupby(level=0).sum().sum(1) - n_sensors
0   -3.000000
1   -3.000000
2   -3.000000
3   -3.000000
4    0.122360
5    0.253160
6   -2.407632
7   -1.937020
8   -2.280236
9   -2.578865
dtype: float64 

Reading a few more of the answers from that search (I should have looked at the DataFrame.sum docs closer)

>>> dfn.sum(level=0).sum(1) - n_sensors       
0   -3.000000
1   -3.000000
2   -3.000000
3   -3.000000
4    0.122360
5    0.253160
6   -2.407632
7   -1.937020
8   -2.280236
9   -2.578865
dtype: float64
wwii
  • 23,232
  • 7
  • 37
  • 77