1

I want to solve the correlation coefficient while one row is removed from the dataframe. Then after getting all the correlation coefficients, I need to remove the row that causes the highest increase in the correlation coefficient.

The code below shows my solution:

import pandas as pd
import numpy as np

#Access the data

file='tc_yolanda2.csv'
df = pd.read_csv(file)

x = df['dist']
y = df['mps']

#compute the correlation coefficient

def correlation_coefficient_4u(a,b):
    correl_mat = np.corrcoef(a,b)
    correlation = correl_mat[0,1]
    return correlation

c = correlation_coefficient_4u(x,y)
print('Correlation coeffcient is:',c)

#Let us try this one

lenght = len(df)
print(lenght)
a = 0
while lenght != 0:
    df.drop([a], inplace=True)
    c = correlation_coefficient_4u(df.dist,df.mps)
    a += 1
    print(round(c,4))

It has successfully generated 50 correlation coefficients but generated also many errors such as

RuntimeWarning: Degrees of freedom <= 0 for slice

RuntimeWarning: divide by zero encountered in double_scalars

RuntimeWarning: invalid value encountered in multiply

RuntimeWarning: Mean of empty slice.

RuntimeWarning: invalid value encountered in true_divide

ValueError: labels [50] not contained in axis

My next problem is how to remove the errors and how to locate the index of the correlation coefficients with the highest negative values so that I could remove that row permanently and repeat the above procedures.

By the way, this is my data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 2 columns):
dist    50 non-null float64
mps     50 non-null int64
dtypes: float64(1), int64(1)
memory usage: 880.0 bytes
None

And the result:

dist  mps
0   441.6    2
1   385.4    7
2   470.7    1
3   402.2    0
4   361.6    0
5   458.6    3
6   453.9    6
7   425.2    4
8   336.6    8
9   265.4    5
10  207.0    5
11  140.5   28
12  229.9    4
13  175.2    6
14  244.5    2
15  455.7    4
16  396.4   12
17  261.8    7
18  291.5    9
19  233.9    2
20  167.8    9
21   88.9   15
22  110.1   25
23   97.1   15
24  160.4   10
25  344.0    0
26  381.6   21
27  391.9    3
28  314.7    2
29  320.7   14
30  252.9   10
31  323.1   12
32  256.0    6
33  281.6    5
34  280.4    5
35  339.8   10
36  301.9   12
37  381.8    0
38  320.2   10
39  347.6    8
40  301.0    4
41  369.7    6
42  378.4    4
43  446.8    4
44  397.4    3
45  454.2    2
46  475.1    0
47  427.0    8
48  463.4    8
49  464.6    2
Correlation coeffcient is: -0.529328951782
49
-0.5209
-0.5227
-0.5091
-0.4998
-0.4975
-0.4879
-0.4903
-0.4838
-0.4845
-0.4908
-0.5085
-0.4541
-0.4736
-0.4962
-0.5273
-0.5189
-0.5452
-0.5494
-0.5485
-0.5882
-0.5999
-0.5711
-0.4321
-0.3251
-0.296
-0.3214
-0.4595
-0.4516
-0.5018
-0.5
-0.4524
-0.431
-0.4514
-0.4955
-0.5603
-0.5263
-0.385
-0.4764
-0.3229
-0.194
-0.3029
-0.1961
-0.2572
-0.2572
-0.6454
-0.7041
-0.5241
-1.0

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3159
    c = cov(x, y, rowvar)
RuntimeWarning: Degrees of freedom <= 0 for slice

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3093
    c *= 1. / np.float64(fact)
RuntimeWarning: divide by zero encountered in double_scalars

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 3093
    c *= 1. / np.float64(fact)
RuntimeWarning: invalid value encountered in multiply
nan

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\lib\function_base.py", line 1110
    avg = a.mean(axis)
RuntimeWarning: Mean of empty slice.

Warning (from warnings module):
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\numpy\core\_methods.py", line 73
    ret, rcount, out=ret, casting='unsafe', subok=False)
RuntimeWarning: invalid value encountered in true_divide
nan
Traceback (most recent call last):
  File "C:/Users/User/Desktop/CARDS 2017 Research Study/Python/methodology.py", line 28, in <module>
    df.drop([a], inplace=True)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 2530, in drop
    obj = obj._drop_axis(labels, axis, level=level, errors=errors)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 2562, in _drop_axis
    new_axis = axis.drop(labels, errors=errors)
  File "C:\Users\User\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\indexes\base.py", line 3744, in drop
    labels[mask])
ValueError: labels [50] not contained in axis
Ino
  • 25
  • 1
  • 11
  • 2
    Hi. Please take the time to read this post on [how to provide a great pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) as well as how to provide a [minimal, complete, and verifiable example](http://stackoverflow.com/help/mcve) and revise your question accordingly. These tips on [how to ask a good question](http://stackoverflow.com/help/how-to-ask) may also be useful. – jezrael Jun 04 '18 at 06:17

1 Answers1

1

You can use the following code to find and remove the row that causes the highest increase in the correlation coefficient.

length=len(df)
def dropcc(df):
    df_temp=df.copy()
    idxmax=0
    c=0

    for i,v in df_temp.iterrows():
        df_temp.drop([i], inplace=True)
        c_temp = correlation_coefficient_4u(df_temp.dist,df_temp.mps)
        if c > c_temp:
            idxmax=i
            c=c_temp
        df_temp=df.copy()
        #print(round(c_temp,4))

    df.drop([idxmax], inplace=True)
    return df

for i in range(0, length-1):
    cc=correlation_coefficient_4u(df.dist,df.mps)
    if cc < -0.9:
        break
    else:
        df=dropcc(df)
Aurelia_B
  • 163
  • 5
  • you did it excellently. Well, my next problem is how to loop that code. Dropping the row that causes the highest change in the correlation coefficient (cc) increases the cc between dist and mps and therefore, I need to stop it once it reaches -0.90 cc. – Ino Jun 05 '18 at 06:08