How to test correlation between two sets in python?

Question

I have two different dataframe and one of them is as below

df1=

      Datetime      BSL
0          7  127.504505
1          8  115.254132
2          9  108.994275
3         10  102.936860
4         11   99.830400
5         12  114.660522
6         13  138.215339
7         14  132.131075
8         15  121.478006
9         16  113.795645
10        17  114.038462

the other one is df2=

    Datetime       Number of Accident
0          7                  3455
1          8                 17388
2          9                 27767
3         10                 33622
4         11                 33474
5         12                 12670
6         13                 28137
7         14                 27141
8         15                 26515
9         16                 24849
10        17                 13013

the first one Blood Sugar Level of people based on time (7 means between 7 am and 8 am) the second one is number of accident between these times

when I try to this code

df1.corr(df2, "pearson")

I got as error:

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

How can I solve it? Or, how can I test correlation between two different variables?

Which modules are you using, include that in your question and put it as tags — Hippolippo, Feb 12 '20 at 13:30
Check my answer here, I beleive it is more or less what you are looking for. Previously you should merge the columns into a single dataframe: https://stackoverflow.com/questions/60116042/calculate-pearson-correlation-in-python/60116249#60116249 — Celius Stingher, Feb 12 '20 at 13:34
What type of correlation are you looking for? The whole series, or hourly? — Celius Stingher, Feb 12 '20 at 13:36
@CeliusStingher, I am looking for hourly correlation, Actually my first question is here https://stackoverflow.com/questions/60182894/pandas-how-to-find-correlation-between-one-time-series-column-accident-times — Gokhan Kazar, Feb 12 '20 at 13:37
You cannot make an hourly correlation because you have only 1 value per hour. It's impossible to make an hourly correlation from the mathematical point of view, you'll get all NaNs — Celius Stingher, Feb 12 '20 at 13:45
Hi, you should value the work of the people who have responded and accept one of the answers. Finally try to answer your question in another separate question how are you doing :) — ansev, Feb 12 '20 at 14:17

score 3 · Accepted Answer · answered Feb 12 '20 at 13:44

from scipy.stats import pearsonr
df_full = df1.merge(df2,how='left')
full_correlation = pearsonr(df_full['BSL'],df_full['Accidents'])
print('Correlation coefficient:',full_correlation[0])
print('P-value:',full_correlation[1])

Output:

(-0.2934597230564072, 0.3811116115819819)
Correlation coefficient: -0.2934597230564072
P-value: 0.3811116115819819

Edit:

You want an hourly correlation, but it is impossible mathematically because you have only 1 x-y value for each hour. Therefore the output will be full of NaNs. This is the code, however the output is invalid:

df_corr = df_full.groupby('Datetime')['BSL','Accidents'].corr().drop(columns='BSL').drop('Accidents',level=1).rename(columns={'Accidents':'Correlation'})
print(df_corr)

Output:

              Correlation
Datetime                 
7        BSL          NaN
8        BSL          NaN
9        BSL          NaN
10       BSL          NaN
11       BSL          NaN
12       BSL          NaN
13       BSL          NaN
14       BSL          NaN
15       BSL          NaN
16       BSL          NaN
17       BSL          NaN

thank you, but I am looking for hour by hour correlation based on this question that I asked before https://stackoverflow.com/questions/60182894/pandas-how-to-find-correlation-between-one-time-series-column-accident-times can you look at that question, this is my real target to achieve — Gokhan Kazar, Feb 12 '20 at 13:48
You have that. You should fix your data, because at is it doesn't represent what you are truly asking. I will try to address the matter in your other question. — Celius Stingher, Feb 12 '20 at 13:49

score 0 · Answer 2 · answered Feb 12 '20 at 13:40

0

Since your dataframes have more than one column, you need to specify the name of column you want to use:

df1['BSL'].corr(df2['Number of Accident'], "pearson")

answered Feb 12 '20 at 13:40

ManojK

1,570
2
9
17

score 0 · Answer 3 · answered Feb 12 '20 at 13:40

0

The corr() method of a pandas dataframe calculates a correlation matrix for all columns in one dataframe. You have two dataframes, so that method won't work. You can solve this by doing:

df1['number'] = df2['Number of Accident']
df1.corr("pearson")

answered Feb 12 '20 at 13:40

Matt L.

3,431
1
15
28

thank, but it did not give hour by hour correlation – Gokhan Kazar Feb 12 '20 at 13:44
I solved the part that can be solved based on the data you provided. – Matt L. Feb 12 '20 at 14:02

How to test correlation between two sets in python?

3 Answers3

Edit: