Calculate pearson correlation in python

Question

I have 4 columns "Country, year, GDP, CO2 emissions"

I want to measure the pearson correlation between GDP and CO2emissions for each country.

The country column has all the countries in the world and the year has the values "1990, 1991, ...., 2018".

Does this answer your question? [Calculating Pearson correlation and significance in Python](https://stackoverflow.com/questions/3949226/calculating-pearson-correlation-and-significance-in-python) — Bram Vanroy, Feb 07 '20 at 15:06

Celius Stingher · Accepted Answer · 2020-02-07T15:25:05.043

You should use a groupby grouped with corr() as your aggregation function:

country = ['India','India','India','India','India','China','China','China','China','China']
Year = [2018,2017,2016,2015,2014,2018,2017,2016,2015,2014]
GDP = [100,98,94,64,66,200,189,165,134,130]
CO2 = [94,96,90,76,64,180,172,150,121,117]
df = pd.DataFrame({'country':country,'Year':Year,'GDP':GDP,'CO2':CO2})
print(df.groupby('country')[['GDP','CO2']].corr()

If we work this output a bit we can go to something fancier:

df_corr = (df.groupby('country')['GDP','CO2'].corr()).drop(columns='GDP').drop('CO2',level=1).rename(columns={'CO2':'Correlation'})
df_corr = df_corr.reset_index().drop(columns='level_1').set_index('country',drop=True)
print(df_corr)

Output:

         Correlation
country             
China       0.999581
India       0.932202

Thank you so much I will apply the same principle but using the pearson correlation to get the P-value with the correlation coefficient — Mustafa Adel, Feb 07 '20 at 15:49
Yes, you can create an extra column with the `P-value` using `pearsonr` from `scipy.stats` — Celius Stingher, Feb 07 '20 at 15:56

score 1 · Answer 2 · answered Feb 07 '20 at 15:18

My guess is that you want to have the pearson coef for each country. Using pearsonr you can loop through and create a dictionary for each country.

from scipy.stats.stats import pearsonr
df = pd.DataFrame({"column1":["value 1", "value 1","value 1","value 1","value 2", "value 2", "value 2", "value 2"], 
              "column2":[1,2,3,4,5, 1,2,3],
             "column3":[10,30,50, 60, 80, 10, 90, 20],
             "column4":[1, 3, 5, 6, 8, 5, 2, 3]})


results = {}
for country in df.column1.unique():
    results[country] = {}
    pearsonr_value = pearsonr(df.loc[df["column1"]== country, "column3"],df.loc[df["column1"] == country, "column4"])
    results[country]["pearson"] = pearsonr_value[0]
    results[country]["pvalue"] = pearsonr_value[0]

print(results["value 1"])
#{'pearson': 1.0, 'pvalue': 1.0}

print(results["value 2"])
#{'pearson': 0.09258200997725514, 'pvalue': 0.09258200997725514}

@MustafaAdel If it answers your question, could you please accept and upvote the answer? Thank you. — sdhaus, Feb 07 '20 at 16:07
I am sorry that I did not vote. Thank you so much for your help. I will try it and let you know how it went I am new to Stackoverflow. — Mustafa Adel, Feb 08 '20 at 07:25

score 0 · Answer 3 · answered Feb 09 '20 at 07:19

0

Thank you @Celius it worked and gave me the results i wanted.

answered Feb 09 '20 at 07:19

Mustafa Adel

33
1
6

Calculate pearson correlation in python

3 Answers3

Linked