Correlation between two Pandas dataframe columns: why does it not work?

Question

I run into the problem of calculating the crosscorrelation. For this assignment we are supposed to use the Pandas .corr method.

I searched around but could not find a suitable solution.

Below is the code.

Top15 gives a Pandas df. The

   Top15 = answer_one()

    %for testing purposes: - works fine :-( 
    df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})
    print(df['A'].corr(df['B']))

    Top15['Population']=Top15['Energy Supply']/Top15['Energy Supply per capita']

    Top15['Citable docs per Capita']=Top15['Citable documents']/Top15['Population']

    % check my data    
    print(Top15['Energy Supply per capita'])
    print(Top15['Citable docs per Capita'])

    correlation=Top15['Citable docs per Capita'].corr(Top15['Energy Supply per capita'])
    print(correlation)
    return correlation

After all this should work. But no, it does not :-(

This the out put I get: (the 1.0 is from test with df.['A] etc.)

1.0
Country
China                  93
United States         286
Japan                 149
United Kingdom        124
Russian Federation    214
Canada                296
Germany               165
India                  26
France                166
South Korea           221
Italy                 109
Spain                 106
Iran                  119
Australia             231
Brazil                 59
Name: Energy Supply per capita, dtype: object
Country
China                   9.269e-05
United States         0.000298307
Japan                 0.000237714
United Kingdom        0.000318721
Russian Federation    0.000127533
Canada                0.000500002
Germany                0.00020942
India                 1.16242e-05
France                 0.00020322
South Korea           0.000239392
Italy                 0.000180175
Spain                  0.00020089
Iran                   0.00011442
Australia             0.000374206
Brazil                4.17453e-05
Name: Citable docs per Capita, dtype: object
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-124-942c0cf8a688> in <module>()
     22     return correlation
     23 
---> 24 answer_nine()

<ipython-input-124-942c0cf8a688> in answer_nine()
     15     Top15['Citable docs per Capita']=np.float64(Top15['Citable docs per Capita'])
     16 
---> 17     correlation=Top15['Citable docs per Capita'].corr(Top15['Energy Supply per capita'])
     18 
     19 

/opt/conda/lib/python3.5/site-packages/pandas/core/series.py in corr(self, other, method, min_periods)
   1392             return np.nan
   1393         return nanops.nancorr(this.values, other.values, method=method,
-> 1394                               min_periods=min_periods)
   1395 
   1396     def cov(self, other, min_periods=None):

/opt/conda/lib/python3.5/site-packages/pandas/core/nanops.py in _f(*args, **kwargs)
     42                                     f.__name__.replace('nan', '')))
     43             try:
---> 44                 return f(*args, **kwargs)
     45             except ValueError as e:
     46                 # we want to transform an object array

/opt/conda/lib/python3.5/site-packages/pandas/core/nanops.py in nancorr(a, b, method, min_periods)
    676 
    677     f = get_corr_func(method)
--> 678     return f(a, b)
    679 
    680 

/opt/conda/lib/python3.5/site-packages/pandas/core/nanops.py in _pearson(a, b)
    684 
    685     def _pearson(a, b):
--> 686         return np.corrcoef(a, b)[0, 1]
    687 
    688     def _kendall(a, b):

/opt/conda/lib/python3.5/site-packages/numpy/lib/function_base.py in corrcoef(x, y, rowvar, bias, ddof)
   2149         # nan if incorrect value (nan, inf, 0), 1 otherwise
   2150         return c / c
-> 2151     return c / sqrt(multiply.outer(d, d))
   2152 
   2153 

AttributeError: 'float' object has no attribute 'sqrt'

I am sorry. But by now I have no clue want goes wrong and why it doesn't work.

Could anyone point me to the solution?

Thanks.

edit: the basic dataframe looks like this (first two line + header):

Rank    Documents   Citable documents   Citations   Self-citations  Citations per document  H index 2006    2007    2008    2009    2010    2011    2012    2013    2014    2015    Energy Supply   Energy Supply per capita    % Renewable
Country                                                                             
China   1   127050  126767  597237  411683  4.70    138 3.992331e+12    4.559041e+12    4.997775e+12    5.459247e+12    6.039659e+12    6.612490e+12    7.124978e+12    7.672448e+12    8.230121e+12    8.797999e+12    1.271910e+11    93  19.754910
United States   2   96661   94747   792274  265436  8.20    230 1.479230e+13    1.505540e+13    1.501149e+13    1.459484e+13    1.496437e+13    1.520402e+13    1.554216e+13    1.577367e+13    1.615662e+13    1.654857e+13    9.083800e+10    286 11.570980
Japan   3   30504   30287   223024  61554   7.31    134 5.496542e+12    5.617036e+12    5.558527e+12    5.251308e+12    5.498718e+12    5.473738e+12    5.569102e+12    5.644659e+12    5.642884e+12    5.669563e+12    1.898400e+10    149 10.232820

Can you upload the dataframe `Top15` as a csv file, so we can see what the content of the dataframe is? Or post a truncated version of the dataframe that also reproduces the error? — Fabian Ying, Nov 02 '17 at 11:47
Looks like similar problem was resolved in this post: https://stackoverflow.com/questions/40453337/python-cannot-make-corr-work — Ivan Burlutskiy, Nov 02 '17 at 11:59
Check also [this post](https://stackoverflow.com/q/42579908/1534017); seems you are taking the same course :) — Cleb, Nov 02 '17 at 13:15

score 1 · Answer 1 · edited Nov 02 '17 at 13:10

1

This did it:

correlation = Top15['Citable docs perCapita']\
         .astype('float64').corr(Top15['Energy Supply per capita']\
         .astype('float64'))

Thanks @Shpionus for pointing out the other post.

edited Nov 02 '17 at 13:10

cs95

379,657
97
704
746

answered Nov 02 '17 at 12:03

Andreas K.

389
4
17

Correlation between two Pandas dataframe columns: why does it not work?

1 Answers1