How to calculate correlation between all columns and remove highly correlated ones using pandas?

Question

I have a huge data set and prior to machine learning modeling it is always suggested that first you should remove highly correlated descriptors(columns) how can i calculate the column wice correlation and remove the column with a threshold value say remove all the columns or descriptors having >0.8 correlation. also it should retained the headers in reduce data..

Example data set

 GA      PN       PC     MBP      GR     AP   
0.033   6.652   6.681   0.194   0.874   3.177    
0.034   9.039   6.224   0.194   1.137   3.4      
0.035   10.936  10.304  1.015   0.911   4.9      
0.022   10.11   9.603   1.374   0.848   4.566    
0.035   2.963   17.156  0.599   0.823   9.406    
0.033   10.872  10.244  1.015   0.574   4.871     
0.035   21.694  22.389  1.015   0.859   9.259     
0.035   10.936  10.304  1.015   0.911   4.5

Please help....

[Feature-Engine](https://feature-engine.readthedocs.io/en/1.1.x/selection/DropCorrelatedFeatures.html) has a built in `DropCorrelatedFeatures()` transformer which does the heavy lifting for you & is sklearn compatible. The `features_to_drop_` attribute shows which it will drop. — kevin_theinfinityfund, Oct 25 '21 at 15:56

score 78 · Answer 1 · edited Mar 11 '23 at 23:36

78

The method here worked well for me, only a few lines of code: https://chrisalbon.com/machine_learning/feature_selection/drop_highly_correlated_features/

import numpy as np

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

# Find features with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

# Drop features 
df.drop(to_drop, axis=1, inplace=True)

edited Mar 11 '23 at 23:36

bryant1410

5,540
4
39
40

answered Sep 26 '18 at 04:27

Cherry Wu

3,844
9
43
63

11

isn't this flawed? Always first column is dropped even though it might not be highly correlated with any other column. when upper triangle is selected none of the first col value remains – Sushant Kulkarni Nov 07 '19 at 03:58
1

have you ever output corr_matrix and see what does it look like first? – Cherry Wu Nov 07 '19 at 04:22
3

I got an error while dropping the selected features, the following code worked for me `df.drop(to_drop,axis=1,inplace=True)` – Ikbel Nov 07 '19 at 15:50
1

@ikbelbenabdessamad yeah, your code is better. I just updated that old version code, thank you! – Cherry Wu Nov 07 '19 at 19:16
3

As of the date of writing this comment, this seems to be working fine. I cross-checked for varying thresholds using other methods provided in answers, and results were identical. Thanks! – Sunit Gautam Nov 07 '20 at 21:22
1

This will drop all columns with corr > 0.95, we want to drop all except one. – Rishabh Agrahari Apr 30 '21 at 09:13
1

It should be `corr_matrix.where((np.triu(np.ones(corr_matrix.shape), k=1) + np.tril(np.ones(corr_matrix.shape), k=-1)).astype(bool))`. Your code does not consider the first column at all. – Mehran Dec 27 '21 at 03:06

NISHA DAGA · Answer 2 · 2019-02-26T14:37:31.887

49

Here is the approach which I have used -

def correlation(dataset, threshold):
    col_corr = set() # Set of all the names of deleted columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if (corr_matrix.iloc[i, j] >= threshold) and (corr_matrix.columns[j] not in col_corr):
                colname = corr_matrix.columns[i] # getting the name of column
                col_corr.add(colname)
                if colname in dataset.columns:
                    del dataset[colname] # deleting the column from the dataset

    print(dataset)

Hope this helps!

edited Feb 26 '19 at 14:37

answered Jun 21 '17 at 11:11

NISHA DAGA

575
7
14

11

I feel like this solution fails in the following general case: Say you have columns c1, c2, and c3. c1 and c2 are correlated above the threshold, the same goes for c2 and c3. With this solution both c2 and c3 will be dropped even though c3 may not be correlated with c1 above that threshold. I suggest changing: `if corr_matrix.iloc[i, j] >= threshold:` To: `if corr_matrix.iloc[i, j] >= threshold and (corr_matrix.columns[j] not in col_corr):` – vcovo Feb 22 '19 at 16:08
@vcovo If c1 & c2 are correlated and c2 & c3 are correlated, then there is a high chance that c1 & c3 will also be correlated. Although, if that is not true, then I believe that your suggestion of changing the code is correct. – NISHA DAGA Feb 23 '19 at 17:43
1

They most likely would be correlated but not necessarily above the same `threshold`. This lead to a significant difference in removed columns for my use case. I ended up with 218 columns instead of 180 when adding the additional condition mentioned in the first comment. – vcovo Feb 24 '19 at 19:41
3

Makes sense. Have updated the code as per your suggestion. – NISHA DAGA Feb 26 '19 at 14:36
@vcovo if c1 and c2 only correlated, how do we choose the best column to remove? – Smart Manoj Aug 21 '20 at 17:05
@SmartManoj in my use case I just wanted to minimize the number of columns and thus removed highly correlated ones. I had no preference for which one to keep and thus removed the second one (as in the rightmost column). I suppose you could create a metric that takes in to account the correlation between each column and all others and then when presented with a highly correlated pair remove the one that is most correlated with all other columns (in order to preserve a little more of the variance). – vcovo Sep 01 '20 at 17:37
2

Shouldn't you use the absolute value of the correlation matrix? – hipoglucido Oct 03 '20 at 15:40
Indeed, absolute value makes much more sense as -0.9 is just as strong as 0.9 – Anonymous May 03 '21 at 09:42

score 12 · Answer 3 · answered Aug 12 '20 at 07:48

Here is an Auto ML class I created to eliminate multicollinearity between features.

What makes my code unique is that out two features that have high correlation, I have eliminated the feature that is least correlated with the target! I got the idea from this seminar by Vishal Patel Sir - https://www.youtube.com/watch?v=ioXKxulmwVQ&feature=youtu.be

#Feature selection class to eliminate multicollinearity
class MultiCollinearityEliminator():
    
    #Class Constructor
    def __init__(self, df, target, threshold):
        self.df = df
        self.target = target
        self.threshold = threshold

    #Method to create and return the feature correlation matrix dataframe
    def createCorrMatrix(self, include_target = False):
        #Checking we should include the target in the correlation matrix
        if (include_target == False):
            df_temp = self.df.drop([self.target], axis =1)
            
            #Setting method to Pearson to prevent issues in case the default method for df.corr() gets changed
            #Setting min_period to 30 for the sample size to be statistically significant (normal) according to 
            #central limit theorem
            corrMatrix = df_temp.corr(method='pearson', min_periods=30).abs()
        #Target is included for creating the series of feature to target correlation - Please refer the notes under the 
        #print statement to understand why we create the series of feature to target correlation
        elif (include_target == True):
            corrMatrix = self.df.corr(method='pearson', min_periods=30).abs()
        return corrMatrix

    #Method to create and return the feature to target correlation matrix dataframe
    def createCorrMatrixWithTarget(self):
        #After obtaining the list of correlated features, this method will help to view which variables 
        #(in the list of correlated features) are least correlated with the target
        #This way, out the list of correlated features, we can ensure to elimate the feature that is 
        #least correlated with the target
        #This not only helps to sustain the predictive power of the model but also helps in reducing model complexity
        
        #Obtaining the correlation matrix of the dataframe (along with the target)
        corrMatrix = self.createCorrMatrix(include_target = True)                           
        #Creating the required dataframe, then dropping the target row 
        #and sorting by the value of correlation with target (in asceding order)
        corrWithTarget = pd.DataFrame(corrMatrix.loc[:,self.target]).drop([self.target], axis = 0).sort_values(by = self.target)                    
        print(corrWithTarget, '\n')
        return corrWithTarget

    #Method to create and return the list of correlated features
    def createCorrelatedFeaturesList(self):
        #Obtaining the correlation matrix of the dataframe (without the target)
        corrMatrix = self.createCorrMatrix(include_target = False)                          
        colCorr = []
        #Iterating through the columns of the correlation matrix dataframe
        for column in corrMatrix.columns:
            #Iterating through the values (row wise) of the correlation matrix dataframe
            for idx, row in corrMatrix.iterrows():                                            
                if(row[column]>self.threshold) and (row[column]<1):
                    #Adding the features that are not already in the list of correlated features
                    if (idx not in colCorr):
                        colCorr.append(idx)
                    if (column not in colCorr):
                        colCorr.append(column)
        print(colCorr, '\n')
        return colCorr

    #Method to eliminate the least important features from the list of correlated features
    def deleteFeatures(self, colCorr):
        #Obtaining the feature to target correlation matrix dataframe
        corrWithTarget = self.createCorrMatrixWithTarget()                                  
        for idx, row in corrWithTarget.iterrows():
            print(idx, '\n')
            if (idx in colCorr):
                self.df = self.df.drop(idx, axis =1)
                break
        return self.df

    #Method to run automatically eliminate multicollinearity
    def autoEliminateMulticollinearity(self):
        #Obtaining the list of correlated features
        colCorr = self.createCorrelatedFeaturesList()                                       
        while colCorr != []:
            #Obtaining the dataframe after deleting the feature (from the list of correlated features) 
            #that is least correlated with the taregt
            self.df = self.deleteFeatures(colCorr)
            #Obtaining the list of correlated features
            colCorr = self.createCorrelatedFeaturesList()                                     
        return self.df

@mjoy Here is an example: `my_eliminator = MultiCollinearityEliminator(df, 'my_target', 0.95)` then you can call the following function: `cleaned_df_no_multi_collinearity = my_eliminator.autoEliminateMulticollinearity()`. NB: The dataframe `df` must contain the target variable column `'my_target'` — JejeBelfort, Jan 25 '23 at 07:41

Synergix · Answer 4 · 2020-05-22T14:54:06.387

I found the answer provided by TomDobbs quite useful, however it doesn't work as intended. It has two problems:

it misses the last pair of variables in each of correlation matrix rows/columns.
it fails to remove one of each pair of collinear variables from the returned dataframe.

My revised version below corrects these issues:

def remove_collinear_features(x, threshold):
    '''
    Objective:
        Remove collinear features in a dataframe with a correlation coefficient
        greater than the threshold. Removing collinear features can help a model 
        to generalize and improves the interpretability of the model.

    Inputs: 
        x: features dataframe
        threshold: features with correlations greater than this value are removed

    Output: 
        dataframe that contains only the non-highly-collinear features
    '''

    # Calculate the correlation matrix
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterate through the correlation matrix and compare correlations
    for i in iters:
        for j in range(i+1):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)

            # If correlation exceeds the threshold
            if val >= threshold:
                # Print the correlated features and the correlation value
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])

    # Drop one of each pair of correlated columns
    drops = set(drop_cols)
    x = x.drop(columns=drops)

    return x

I really liked it! Have used it for a model I'm building and really easy to understand - thanks a ton for this. — SQLGIT_GeekInTraining, Oct 31 '20 at 08:41

abakar · Answer 5 · 2020-07-22T08:18:21.843

You can test this code below ?

Load libraries import

  pandas as pd
  import numpy as np
# Create feature matrix with two highly correlated features

X = np.array([[1, 1, 1],
          [2, 2, 0],
          [3, 3, 1],
          [4, 4, 0],
          [5, 5, 1],
          [6, 6, 0],
          [7, 7, 1],
          [8, 7, 0],
          [9, 7, 1]])

# Convert feature matrix into DataFrame
df = pd.DataFrame(X)

# View the data frame
df

# Create correlation matrix
corr_matrix = df.corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.95
to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]
# Drop features 
df.drop(df[to_drop], axis=1)

While this code may provide a solution to the question, it's better to add context as to why/how it works. This can help future users learn, and apply that knowledge to their own code. You are also likely to have positive feedback from users in the form of upvotes, when the code is explained. — borchvm, Feb 14 '20 at 10:21

score 8 · Answer 6 · edited Jan 16 '18 at 18:08

8

You can use the following for a given data frame df:

corr_matrix = df.corr().abs()
high_corr_var=np.where(corr_matrix>0.8)
high_corr_var=[(corr_matrix.columns[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x<y]

edited Jan 16 '18 at 18:08

Himanshu Chaudhary

25
9

answered Jul 20 '17 at 21:24

Mojgan Mazouchi

355
1
6
15

1

This did not work for me. Please consider rewriting your solution as a method. Error: "ValueError: too many values to unpack (expected 2)". – MyopicVisage Aug 04 '17 at 19:54
1

It should rather be `high_corr_var=[(corr_matrix.index[x],corr_matrix.columns[y]) for x,y in zip(*high_corr_var) if x!=y and x – Jeru Luke Sep 26 '17 at 16:46

score 6 · Answer 7 · answered Mar 27 '15 at 07:51

6

Firstly, I'd suggest using something like PCA as a dimensionality reduction method, but if you have to roll your own then your question is insufficiently constrained. Where two columns are correlated, which one do you want to remove? What if column A is correlated with column B, while column B is correlated with column C, but not column A?

You can get a pairwise matrix of correlations by calling DataFrame.corr() (docs) which might help you with developing your algorithm, but eventually you need to convert that into a list of columns to keep.

answered Mar 27 '15 at 07:51

Jamie Bull

12,889
15
77
116

While I totally agree with your reasoning, this does not really answer the question. `PCA` is a more advanced concept for dimension reduction. But note that using correlations does work and the question is a reasonable (but definitely lacking research effort IMO). – cel Mar 27 '15 at 08:40
@Jamie bull Thanks for your kind reply before going to advanced techniques like dimensionality reduction(Ex. PCA ) or Feature selection method (Ex. Tree based or SVM based feature elimination ) it is always suggested to remove useless feature with the help of basic techniques (like variance calculation of correlation calculation), that I learned with the help of various published works available. And as per the second part of your comment "correlations by calling DataFrame.corr()" would be helpful for my case. – jax Mar 27 '15 at 09:09
2

@jax, `it is always suggested to remove useless feature with the help of basic techniques`. This is not true. There are various methods which do not require such a preprocessing step. – cel Mar 27 '15 at 09:20
@cel ok, actually i was following some published work so they have suggested the preprocessing steps. Can you please suggest me any one such method which not bother about preprocessing steps thanks . – jax Mar 27 '15 at 09:46
There's a discussion of when you should remove correlated variables before PCA [here](http://stats.stackexchange.com/questions/50537/should-one-remove-highly-correlated-variables-before-doing-pca). It comes down to whether they are correlated because they are both influenced by each other or a third underlying feature, in which case there is an argument for removing one them. Or alternatively where they are correlated but not because they are truly related, in which case there is an argument for keeping both. This depends on understanding the variables and so isn't easily done algorithmically. – Jamie Bull Mar 27 '15 at 12:30
1

@JamieBull Thanks for your reply i have already been there(the web link you have suggested) before posting this. But if you have gone through the Questions careful this post covers only half answer of the Question but i have already read a lot and hopefully soon i will post answer with my self. thanks a lot for all your support and interest. thanks – jax Mar 27 '15 at 15:31

score 5 · Answer 8 · answered Aug 02 '18 at 17:12

I took the liberty to modify TomDobbs' answer. The reported bug in the comments is removed now. Also, the new function filters out the negative correlation, too.

def corr_df(x, corr_val):
    '''
    Obj: Drops features that are strongly correlated to other features.
          This lowers model complexity, and aids in generalizing the model.
    Inputs:
          df: features df (x)
          corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
    Output: df that only includes uncorrelated features
    '''

    # Creates Correlation Matrix and Instantiates
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterates through Correlation Matrix Table to find correlated columns
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = item.values
            if abs(val) >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(i)

    drops = sorted(set(drop_cols))[::-1]

    # Drops the correlated columns
    for i in drops:
        col = x.iloc[:, (i+1):(i+2)].columns.values
        x = x.drop(col, axis=1)
    return x

The loops you have here skip the first two columns of the corr_matrix, and so correlation between col1 & col2 is not considered, after that looks ok — Ryan, Feb 06 '19 at 15:26

score 3 · Answer 9 · answered Mar 29 '17 at 21:22

Plug your features dataframe in this function and just set your correlation threshold. It'll auto drop columns, but will also give you a diagnostic of the columns it drops if you want to do it manually.

def corr_df(x, corr_val):
    '''
    Obj: Drops features that are strongly correlated to other features.
          This lowers model complexity, and aids in generalizing the model.
    Inputs:
          df: features df (x)
          corr_val: Columns are dropped relative to the corr_val input (e.g. 0.8)
    Output: df that only includes uncorrelated features
    '''

    # Creates Correlation Matrix and Instantiates
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []

    # Iterates through Correlation Matrix Table to find correlated columns
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = item.values
            if val >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(i)

    drops = sorted(set(drop_cols))[::-1]

    # Drops the correlated columns
    for i in drops:
        col = x.iloc[:, (i+1):(i+2)].columns.values
        df = x.drop(col, axis=1)

    return df

This doesn't seem to work for me. The correlations are found and the pairs that match the threshold (i.e. have a higher correlation) are printed. But the resulting dataframe is only missing one (the first) variable, that has a high correlation. — n1k31t4, Jun 13 '17 at 21:30

Emkan · Answer 10 · 2020-09-16T11:51:14.027

At first, thanks to TomDobbs and Synergix for their code. Below I am sharing my modifield version with some additions:

Between two correlated variables this function drops a variable which has the least correlation with the target variable
Added some useful logs (set verbose to True for log printing)

def remove_collinear_features(df_model, target_var, threshold, verbose):
    '''
    Objective:
        Remove collinear features in a dataframe with a correlation coefficient
        greater than the threshold and which have the least correlation with the target (dependent) variable. Removing collinear features can help a model 
        to generalize and improves the interpretability of the model.

    Inputs: 
        df_model: features dataframe
        target_var: target (dependent) variable
        threshold: features with correlations greater than this value are removed
        verbose: set to "True" for the log printing

    Output: 
        dataframe that contains only the non-highly-collinear features
    '''

    # Calculate the correlation matrix
    corr_matrix = df_model.drop(target_var, 1).corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []
    dropped_feature = ""

    # Iterate through the correlation matrix and compare correlations
    for i in iters:
        for j in range(i+1): 
            item = corr_matrix.iloc[j:(j+1), (i+1):(i+2)]
            col = item.columns
            row = item.index
            val = abs(item.values)

            # If correlation exceeds the threshold
            if val >= threshold:
                # Print the correlated features and the correlation value
                if verbose:
                    print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                col_value_corr = df_model[col.values[0]].corr(df_model[target_var])
                row_value_corr = df_model[row.values[0]].corr(df_model[target_var])
                if verbose:
                    print("{}: {}".format(col.values[0], np.round(col_value_corr, 3)))
                    print("{}: {}".format(row.values[0], np.round(row_value_corr, 3)))
                if col_value_corr < row_value_corr:
                    drop_cols.append(col.values[0])
                    dropped_feature = "dropped: " + col.values[0]
                else:
                    drop_cols.append(row.values[0])
                    dropped_feature = "dropped: " + row.values[0]
                if verbose:
                    print(dropped_feature)
                    print("-----------------------------------------------------------------------------")

    # Drop one of each pair of correlated columns
    drops = set(drop_cols)
    df_model = df_model.drop(columns=drops)

    print("dropped columns: ")
    print(list(drops))
    print("-----------------------------------------------------------------------------")
    print("used columns: ")
    print(df_model.columns.tolist())

    return df_model

[Is it safe to replace '==' with 'is' to compare Boolean-values](https://stackoverflow.com/a/4591139) — Smart Manoj, Sep 16 '20 at 01:51
If we will add abs( ) function while calculating the correlation value between target and feature, we will not see negative correlation value. It is important because when we have negative correlation code drops smaller one which has stronger negative correlation value. /// col_corr = abs(df_model[col.values[0]].corr(df_model[target_var])) — Yiğit Can Taşoğlu, Sep 18 '21 at 12:38

score 3 · Answer 11 · answered Oct 14 '21 at 15:12

I know that there are already a lot of answers on that but one way I found very simple and short is the following:


# Get correlation matrix 
corr = X.corr()

# Create a mask for values above 90% 
# But also below 100% since it variables correlated with the same one
mask = (X.corr() > 0.9) & (X.corr() < 1.0)
high_corr = corr[mask]

# Create a new column mask using any() and ~
col_to_filter_out = ~high_corr[mask].any()

# Apply new mask
X_clean = X[high_corr.columns[col_to_filter_out]]

# Visualize cleaned dataset
X_clean

tdogan · Answer 12 · 2020-07-20T18:02:15.113

If you run out of memory due to pandas .corr() you may find the following solution useful:

    import numpy as np 
    from numba import jit
    
    @jit(nopython=True)
    def corr_filter(X, threshold):
        n = X.shape[1]
        columns = np.ones((n,))
        for i in range(n-1):
            for j in range(i+1, n):
                if columns[j] == 1:
                    correlation = np.abs(np.corrcoef(X[:,i], X[:,j])[0,1])
                    if correlation >= threshold:
                        columns[j] = 0
        return columns
    
    columns = corr_filter(df.values, 0.7).astype(bool) 
    selected_columns = df.columns[columns]

Hi! Welcome to SO. Thank you for the contribution! Here is a guide on how to share your knowledge: https://stackoverflow.blog/2011/07/01/its-ok-to-ask-and-answer-your-own-questions/ — Bedir Yilmaz, Jul 20 '20 at 18:16

score 1 · Answer 13 · answered Apr 01 '19 at 13:07

A small revision to the solution posted by user3025698 that resolves an issue where the correlation between the first two columns is not captured and some data type checking.

def filter_df_corr(inp_data, corr_val):
    '''
    Returns an array or dataframe (based on type(inp_data) adjusted to drop \
        columns with high correlation to one another. Takes second arg corr_val
        that defines the cutoff

    ----------
    inp_data : np.array, pd.DataFrame
        Values to consider
    corr_val : float
        Value [0, 1] on which to base the correlation cutoff
    '''
    # Creates Correlation Matrix
    if isinstance(inp_data, np.ndarray):
        inp_data = pd.DataFrame(data=inp_data)
        array_flag = True
    else:
        array_flag = False
    corr_matrix = inp_data.corr()

    # Iterates through Correlation Matrix Table to find correlated columns
    drop_cols = []
    n_cols = len(corr_matrix.columns)

    for i in range(n_cols):
        for k in range(i+1, n_cols):
            val = corr_matrix.iloc[k, i]
            col = corr_matrix.columns[i]
            row = corr_matrix.index[k]
            if abs(val) >= corr_val:
                # Prints the correlated feature set and the corr val
                print(col, "|", row, "|", round(val, 2))
                drop_cols.append(col)

    # Drops the correlated columns
    drop_cols = set(drop_cols)
    inp_data = inp_data.drop(columns=drop_cols)
    # Return same type as inp
    if array_flag:
        return inp_data.values
    else:
        return inp_data

score 1 · Answer 14 · answered Dec 17 '20 at 04:14

The question here refers to a HUGE dataset. However, all of the answers I see are dealing with dataframes. I present an answer for a scipy sparse matrix which runs in parallel. Rather than returning a giant correlation matrix, this returns a feature mask of fields to keep after checking all fields for both positive and negative Pearson correlations.

I also try to minimize calculations using the following strategy:

Process each column
Start at the current column + 1 and calculate correlations moving to the right.
For any abs(correlation) >= threshold, mark the current column for removal and calculate no further correlations.
Perform these steps for each column in the dataset except the last.

This might be sped up further by keeping a global list of columns marked for removal and skipping further correlation calculations for such columns, since columns will execute out of order. However, I do not know enough about race conditions in python to implement this tonight.

Returning a column mask will obviously allow the code to handle much larger datasets than returning the entire correlation matrix.

Check each column using this function:

def get_corr_row(idx_num, sp_mat, thresh):
    # slice the column at idx_num
    cols = sp_mat.shape[1]
    x = sp_mat[:,idx_num].toarray().ravel()
    start = idx_num + 1
    
    # Now slice each column to the right of idx_num   
    for i in range(start, cols):
        y = sp_mat[:,i].toarray().ravel()
        # Check the pearson correlation
        corr, pVal = pearsonr(x,y)
        # Pearson ranges from -1 to 1.
        # We check both positive and negative correlations >= thresh using abs(corr)
        if abs(corr) >= thresh:
            # stop checking after finding the 1st correlation > thresh   
            return False
            # Mark column at idx_num for removal in the mask  
    return True

Run the column level correlation checks in parallel:

from joblib import Parallel, delayed  
import multiprocessing


def Get_Corr_Mask(sp_mat, thresh, n_jobs=-1):
    
    # we must make sure the matrix is in csc format 
    # before we start doing all these column slices!  
    sp_mat = sp_mat.tocsc()
    cols = sp_mat.shape[1]
    
    if n_jobs == -1:
        # Process the work on all available CPU cores
        num_cores = multiprocessing.cpu_count()
    else:
        # Process the work on the specified number of CPU cores
        num_cores = n_jobs

    # Return a mask of all columns to keep by calling get_corr_row() 
    # once for each column in the matrix     
    return Parallel(n_jobs=num_cores, verbose=5)(delayed(get_corr_row)(i, sp_mat, thresh)for i in range(cols))

General Usage:

#Get the mask using your sparse matrix and threshold.
corr_mask = Get_Corr_Mask(X_t_fpr, 0.95) 

# Remove features that are >= 95% correlated
X_t_fpr_corr = X_t_fpr[:,corr_mask]

score 1 · Answer 15 · answered Jan 26 '22 at 12:15

If you wanted to return a breakdown of correlated columns you could use this function to look at them to see what you are dropping and adjust your threshold

def corr_cols(df,thresh):
    # Create correlation matrix
    corr_matrix = df.corr().abs()
    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool_))

    dic = {'Feature_1':[],'Featur_2':[],'val':[]}
    for col in upper.columns:
        corl = list(filter(lambda x: x >= thresh, upper[col] ))
        #print(corl)
        if len(corl) > 0:
            inds = [round(x,4) for x in corl]
            for ind in inds:
                #print(col)
                #print(ind)
                col2 = upper[col].index[list(upper[col].apply(lambda x: round(x,4))).index(ind)]
                #print(col2)
                dic['Feature_1'].append(col)
                dic['Featur_2'].append(col2)
                dic['val'].append(ind) 
    return pd.DataFrame(dic).sort_values(by="val", ascending=False)

And then remove them by calling the df

    corr = corr_cols(star,0.5)
    df.drop(columns = corr.iloc[:,0].unique())

DavidSilverberg · Answer 16 · 2022-11-22T21:05:34.863

There are three challenges to this problem. First, if features x and y are correlated, you don't want to use an algorithm that drops both. Second, if x and y are pairwise correlated and features y and z are also pairwise correlated, you want the algorithm to only remove y. In this sense, you want it to remove the minimum number of features so that no remaining features have correlations above your threshold. Third, from an efficiency standpoint, you do not want to have to compute the correlation matrix more than once.

Here's an option:

def corr_cleaner(df,corr_cutoff):
    '''
    df: pandas dataframe with column headers.
    corr_cutoff: float between 0 and 1.
    '''
    abs_corr_matrix = df.corr().abs()
    filtered_cols = []
    while True:
        offenders = []
        for i in range(len(abs_corr_matrix)):
            for j in range(len(abs_corr_matrix)):
                if i != j:
                    if abs_corr_matrix.iloc[i,j] > corr_cutoff:
                        offenders.append(df.columns[i])

        if len(offenders) > 0: # if at least one high correlation remains
            c = Counter(offenders)
            worst_offender = c.most_common(1)[0][0]  # var name of worst offender
            del df[worst_offender]
            filtered_cols.append(worst_offender)
            abs_corr_matrix.drop(worst_offender, axis=0, inplace=True) #drop from x-axis
            abs_corr_matrix.drop(worst_offender, axis=1, inplace=True) #drop from y-axis
        else: # if no high correlations remain, break
            break

    return df, filtered_cols

score 0 · Answer 17 · answered Dec 17 '19 at 16:16

This is the approach I used on my job last month. Perhaps it is not the best or quickest way, but it works fine. Here, df is my original Pandas dataframe:

dropvars = []
threshold = 0.95
df_corr = df.corr().stack().reset_index().rename(columns={'level_0': 'Var 1', 'level_1': 'Var 2', 0: 'Corr'})
df_corr = df_corr[(df_corr['Corr'].abs() >= threshold) & (df_corr['Var 1'] != df_corr['Var 2'])]
while len(df_corr) > 0:
    var = df_corr['Var 1'].iloc[0]
    df_corr = df_corr[((df_corr['Var 1'] != var) & (df_corr['Var 2'] != var))]
    dropvars.append(var)
df.drop(columns=dropvars, inplace=True)

My idea is as follows: first, I create a dataframe containing columna Var 1, Var 2 and Corr, where I keep only those pairs of variables whose correlation is higher than or equal my threshold (in absolute value). Then, I iteratively choose the first variable (Var 1 value) in this correlations dataframe, add it to dropvar list, and remove all lines of the correlations dataframe where it appears, until my correlations dataframe is empty. In the end, I remove the columns in my dropvar list from my original dataframe.

score 0 · Answer 18 · answered Jan 04 '20 at 23:06

I had a similar question today and came across this post. This is what I ended up with.

def uncorrelated_features(df, threshold=0.7):
    """
    Returns a subset of df columns with Pearson correlations
    below threshold.
    """

    corr = df.corr().abs()
    keep = []
    for i in range(len(corr.iloc[:,0])):
        above = corr.iloc[:i,i]
        if len(keep) > 0: above = above[keep]
        if len(above[above < threshold]) == len(above):
            keep.append(corr.columns.values[i])

    return df[keep]

score 0 · Answer 19 · answered Apr 11 '20 at 13:47

I write my own way without any for loop to delete high covariance data from pandas dataframe

#get co variance of data
coVar = df.corr() # or df.corr().abs()
threshold = 0.5 # 
"""
1. .where(coVar != 1.0) set NaN where col and index is 1
2. .where(coVar >= threshold) if not greater than threshold set Nan
3. .fillna(0) Fill NaN with 0
4. .sum() convert data frame to serise with sum() and just where is co var greater than threshold sum it
5. > 0 convert all Series to Boolean
"""

coVarCols = coVar.where(coVar != 1.0).where(coVar >=threshold).fillna(0).sum() > 0

# Not Boolean Becuase we need to delete where is co var greater than threshold 
coVarCols = ~coVarCols

# get where you want
df[coVarCols[coVarCols].index]

I hope that's can help to use own pandas function to work with out any for loop, That's can help Improve your speed in big dataset

score 0 · Answer 20 · answered Aug 02 '20 at 00:07

correlatedColumns = []
corr = df.corr()
indices = corr.index
columns = corr.columns
posthreshold = 0.7
negthreshold = -0.7

for c in columns:
    for r in indices:
        if c != r and (corr[c][r] > posthreshold or corr[c][r] < negthreshold):
            correlatedColumns.append({"column" : c , "row" : r , "val" :corr[c][r] })
            

print(correlatedColumns)

suhail · Answer 21 · 2021-02-10T05:32:12.957

0

in my code i need to remove low correlated columns with the dependent variable, and i got this code

to_drop = pd.DataFrame(to_drop).fillna(True)
to_drop = list(to_drop[to_drop['SalePrice'] <.4 ].index)
df_h1.drop(to_drop,axis=1)

df_h1 is my dataframe and SalePrice is the dependent variable... i think changing the value may suit for all other problems

edited Feb 10 '21 at 05:32

answered Feb 10 '21 at 05:21

suhail

21
4

score 0 · Answer 22 · answered Jun 15 '21 at 19:59

The below snippet drop the most correlated features recursively.

def get_corr_feature(df):
    corr_matrix = df.corr().abs()
    # Select upper triangle of correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool_))
    upper['score']= upper.max(axis=1)
    upper.sort_values(by=['score'],ascending=False)
    #Find the most correlated feature and send return it for drop
    column_name=upper.sort_values(by=['score'],ascending=False).index[0]
    max_score=upper.loc[column_name,'score']
    return column_name, max_score

max_score=1
while max_score>0.5:
    column_name, max_score=get_corr_feature(df)
    df.drop(column_name,axis=1,inplace=True)

score 0 · Answer 23 · answered Jul 28 '21 at 12:05

I wrote a notebook that uses partial correlations

https://gist.github.com/thistleknot/ce1fc38ea9fcb1a8dafcfe6e0d8af475

the gist of it (pun intended)

for train_index, test_index in kfold.split(all_data):
    #print(iteration)
    max_pvalue = 1
    
    subset = all_data.iloc[train_index].loc[:, ~all_data.columns.isin([exclude])]
    
    #skip y and states
    set_ = subset.loc[:, ~subset.columns.isin([target])].columns.tolist()
    
    n=len(subset)
    
    while(max_pvalue>=.05):

        dist = scipy.stats.beta(n/2 - 1, n/2 - 1, loc=-1, scale=2)
        p_values = pd.DataFrame(2*dist.cdf(-abs(subset.pcorr()[target]))).T
        p_values.columns = list(subset.columns)
        
        max_pname = p_values.idxmax(axis=1)[0]
        max_pvalue = p_values[max_pname].values[0]
        
        if (max_pvalue > .05):

            set_.remove(max_pname)
            temp = [target]
            temp.extend(set_)
            subset = subset[temp]
    
    winners = p_values.loc[:, ~p_values.columns.isin([target])].columns.tolist()
    sig_table = (sig_table + np.where(all_data.columns.isin(winners),1,0)).copy()
    
    signs_table[all_data.columns.get_indexer(winners)]+=np.where(subset.pcorr()[target][winners]<0,-1,1)


significance = pd.DataFrame(sig_table).T
significance.columns = list(all_data.columns)
display(significance)

sign = pd.DataFrame(signs_table).T
sign.columns = list(all_data.columns)
display(sign)

purity = abs((sign/num_folds)*(sign/significance)).T.replace([np.inf, -np.inf, np.NaN], 0)
display(purity.T)

score 0 · Answer 24 · answered Dec 27 '21 at 03:41

I believe this has to be done in an iterative way:

uncorrelated_features = features.copy()

# Loop until there's nothing to drop
while True:
    # Calculating the correlation matrix for the remaining list of features
    cor = uncorrelated_features.corr().abs()

    # Generating a square matrix with all 1s except for the main axis
    zero_main = np.triu(np.ones(cor.shape), k=1) +
        np.tril(np.ones(cor.shape), k=-1)

    # Using the zero_main matrix to filter out the main axis of the correlation matrix
    except_main = cor.where(zero_main.astype(bool))

    # Calculating some metrics for each column, including the max correlation,
    # mean correlation and the name of the column
    mertics = [(except_main[column].max(), except_main[column].mean(), column) for column in except_main.columns]

    # Sort the list to find the most suitable candidate to drop at index 0
    mertics.sort(key=lambda x: (x[0], x[1]), reverse=True)

    # Check and see if there's anything to drop from the list of features
    if mertics[0][0] > 0.5:
        uncorrelated_features.drop(mertics[0][2], axis=1, inplace=True)
    else:
        break

It's worth mentioning that you might want to customize the way I sorted the metrics list and/or how I detected whether I want to drop the column or not.

Ming Jun Lim · Answer 25 · 2022-03-18T12:06:32.527

I manage to do it using this way. Kindly have a try. However, the way I did is just reached display purposes as I want to capture the result in my report. If you want to drop it, you can choose any columns from the dataframe below to drop it since can just choose either 1.

row_index = 0
corrDict = {}
row_name = []
col_name = []
corr_val = []

while row_index < len(df.corr().index.tolist()):
    for index, x in enumerate(df.corr().iloc[row_index, :]):
        if abs(x) >= 0.8 and index != row_index:
            if abs(x) in corr_val:
                if (df.corr().index.tolist()[row_index] in col_name) and (df.corr().columns.tolist()[index] in row_name):
                    continue
            row_name.append(df.corr().index.tolist()[row_index])
            col_name.append(df.corr().columns.tolist()[index])
            corr_val.append(x)
    row_index += 1
    
corrDict ={"First Feature (FF)": row_name, "Second Feature (SF)": col_name, "Correlation (FF x SF)": corr_val}
corr_df2=pd.DataFrame(corrDict)
corr_df2

This is my output:

You can choose either First Feature (FF) or Second Feature (SF). To drop highly correlated features from your original dataset:
your_df.drop(corr_df2['First Feature (FF)'].tolist(), axis=1, inplace=True)

Kaleb Roncatti · Answer 26 · 2022-12-22T02:11:24.847

You could use the following function, you'll also gets the elements sorted:

def correlation(dataset, threshold = 0.3):
  c = dataset.corr().abs()
  s = c.unstack()
  so = s.sort_values(kind="quicksort")
  results = []
  for index, row in so.items():
    if index[0] != index[1] and row > threshold:
      results.append({index: row})
  return results

You could invoke the function sending a pandas dataset that you want to find the correlation and the threshold as follows:

highly_correlated_features = correlation(dataset=data_train_val_without_label, threshold=0.35)
highly_correlated_features

It would result in something like this for a dataset with the following columns and the default threshold:

Input columns:

 0   HighBP                202944 non-null  float64
 1   HighChol              202944 non-null  float64
 2   CholCheck             202944 non-null  float64
 3   BMI                   202944 non-null  float64
 4   Smoker                202944 non-null  float64
 5   Stroke                202944 non-null  float64
 6   HeartDiseaseorAttack  202944 non-null  float64
 7   PhysActivity          202944 non-null  float64
 8   Fruits                202944 non-null  float64
 9   Veggies               202944 non-null  float64
 10  HvyAlcoholConsump     202944 non-null  float64
 11  AnyHealthcare         202944 non-null  float64
 12  NoDocbcCost           202944 non-null  float64
 13  GenHlth               202944 non-null  float64
 14  MentHlth              202944 non-null  float64
 15  PhysHlth              202944 non-null  float64
 16  DiffWalk              202944 non-null  float64
 17  Sex                   202944 non-null  float64
 18  Age                   202944 non-null  float64
 19  Education             202944 non-null  float64
 20  Income                202944 non-null  float64

Output:

[{('Income', 'Education'): 0.38083797089605675},
 {('Education', 'Income'): 0.38083797089605675},
 {('DiffWalk', 'PhysHlth'): 0.38145172573435343},
 {('PhysHlth', 'DiffWalk'): 0.38145172573435343},
 {('DiffWalk', 'GenHlth'): 0.385707943062701},
 {('GenHlth', 'DiffWalk'): 0.385707943062701},
 {('PhysHlth', 'GenHlth'): 0.3907082729122655},
 {('GenHlth', 'PhysHlth'): 0.3907082729122655}]

Please consider providing output of the source code in the answer, so users can correlate it with the problem statement. — Azhar Khan, Dec 17 '22 at 09:32

score 0 · Answer 27 · answered Dec 30 '22 at 01:20

Can use statsmodels's varianve_inflation_factor to detect multicollinearity in the dataframe.

from statsmodels.stats.outliers_influence import variance_inflation_factor

def vif(X):
    vif = pd.DataFrame()
    vif['Variables'] = X.columns
    vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    return vif

Where X is the DataFrame. VIF for columns that involve multicollinearity will be more than 10. For columns that can be perfectly reproduced by linear combination of other available columns, then its vif value will be infinity. So remove columns now by one, until all infinity values and higher vif values are removed.

score 0 · Answer 28 · answered Feb 09 '23 at 13:25

You can use the Following code:

l=[]
corr_matrix = df.corr().abs()

for ci in corr_matrix.columns: 
    for cj in corr_matrix.columns: 
        if (corr_matrix[ci][cj]>0.8 and ci!=cj):
            l.append(ci)
            
l = np.array(l)
to_drop = np.unique(l)
df.drop(to_drop, axis=1, inplace=True)

How to calculate correlation between all columns and remove highly correlated ones using pandas?

28 Answers28

Load libraries import

Linked

Related