3

Having a tough time finding an example of this, but I'd like to somehow use Dask to drop pairwise correlated columns if their correlation threshold is above 0.99. I CAN'T use Pandas' correlation function as my dataset is too large, and it eats up my memory in a hurry. What I have now is a slow, double for loop that starts with the first column, and finds the correlation threshold between it and all the other columns one by one, and if it's above 0.99, drop that 2nd comparative column, then starts at the new second column, and so on and so forth, KIND OF like the solution found here, however this is unbearably slow doing this in an iterative form across all columns, although it's actually possible to run it and not run into memory issues.

I've read the API here, and see how to drop columns using Dask here, but need some assistance in getting this figured out. I'm wondering if there's a faster, yet memory friendly, way of dropping highly correlated columns in a Pandas Dataframe using Dask? I'd like to feed in a Pandas dataframe to the function, and have it return a Pandas dataframe after the correlation dropping is done.

Anyone have any resources I can check out, or have an example of how to do this?

Thanks!

UPDATE As requested, here is my current correlation dropping routine as described above:

print("Checking correlations of all columns...")

cols_to_drop_from_high_corr = []
corr_threshold = 0.99

for j in df.iloc[:,1:]:  # Skip column 0

    try:  # encompass the below in a try/except, cuz dropping a col in the 2nd 'for' loop below will screw with this
        # original list, so if a feature is no longer in there from dropping it prior, it'll throw an error

        for k in df.iloc[:,1:]:  # Start 2nd loop at first column also...

            # If comparing the same column to itself, skip it
            if (j == k):  
                continue

            else:
                try: # second try/except mandatory
                    correlation = abs(df[j].corr(df[k]))  # Get the correlation of the first col and second col

                    if correlation > corr_threshold:  # If they are highly correlated...
                        cols_to_drop_from_high_corr.append(k)  # Add the second col to list for dropping when round is done before next round.")
                
                except:
                    continue

        # Once we have compared the first col with all of the other cols...
        if len(cols_to_drop_from_high_corr) > 0:
            df = df.drop(cols_to_drop_from_high_corr, axis=1)  # Drop all the 2nd highly corr'd cols
            cols_to_drop_from_high_corr = [] # Reset the list for next round
            # print("Dropped all cols from most recent round. Continuing...")

    except:  # Now, if the first for loop tries to find a column that's been dropped already, just continue on
        continue

print("Correlation dropping completed.")

UPDATE Using the solution below, I'm running into a few errors and due to my limited dask syntax knowledge, I'm hoping to get some insight. Running Windows 10, Python 3.6 and the latest version of dask.

Using the code as is on MY dataset (the dataset in the link says "file not found"), I ran into the first error:

ValueError: Exactly one of npartitions and chunksize must be specified.

So I specify npartitions=2 in the from_pandas, then get this error:

AttributeError: 'Array' object has no attribute 'compute_chunk_sizes'

I tried changing that to .rechunk('auto'), but then got error:

ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes

My original dataframe is in the shape of 1275 rows, and 3045 columns. The dask array shape says shape=(nan, 3045). Does this help to diagnose the issue at all?

wildcat89
  • 1,159
  • 16
  • 47
  • 1
    Hi Matt do you mind to share as [mcve](/help/mcve)? – rpanai Jul 09 '20 at 15:56
  • @rpanai sure, I will upload a reproducible example of what I have in a couple of minutes – wildcat89 Jul 09 '20 at 19:00
  • 1
    I have a couple of questions. Do you want to drop both the highly correlated columns or just one of them? Did you already tried to use the same `pandas` only code (without for loops) on `dask`? – rpanai Jul 09 '20 at 20:21
  • I'm only looking to drop one of the 2 correlated columns, sorry I should have been more specific. I will check out your answer later tonight and see what I can come up. Thanks a lot! – wildcat89 Jul 09 '20 at 22:13

1 Answers1

3

I'm not sure if this help but maybe it could be a starting point.

Pandas

import pandas as pd
import numpy as np

url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"

df = pd.read_csv(url)

# we check correlation for these columns only
cols = df.columns[-8:]

# columns in this df don't have a big 
# correlation coefficient
corr_threshold = 0.5

corr = df[cols].corr().abs().values

# we take the upper triangular only
corr = np.triu(corr)

# we want high correlation but not diagonal elements
# it returns a bool matrix
out = (corr != 1) & (corr > corr_threshold)

# for every row we want only the True columns
cols_to_remove = []
for o in out:
    cols_to_remove += cols[o].to_list()

cols_to_remove = list(set(cols_to_remove))

df = df.drop(cols_to_remove, axis=1)

Dask

Here I comment only the steps are different from pandas

import dask.dataframe as dd
import dask.array as da

url = "https://raw.githubusercontent.com/dylan-profiler/heatmaps/master/autos.clean.csv"

df = dd.read_csv(url)

cols = df.columns[-8:]

corr_threshold = 0.5

corr = df[cols].corr().abs().values

# with dask we need to rechunk
corr = corr.compute_chunk_sizes()

corr = da.triu(corr)

out = (corr != 1) & (corr > corr_threshold)

# dask is lazy
out = out.compute()

cols_to_remove = []
for o in out:
    cols_to_remove += cols[o].to_list()

cols_to_remove = list(set(cols_to_remove))

df = df.drop(cols_to_remove, axis=1)

rpanai
  • 12,515
  • 2
  • 42
  • 64
  • Check out the errors I'm getting above in my update to the question. Any thoughts? Thanks! – wildcat89 Jul 10 '20 at 16:05
  • From pandas, sorry. I substituted "df = dd.read_csv(url)" for "df = dd.from_pandas(df, npartitions=2)" – wildcat89 Jul 10 '20 at 19:51
  • 1
    Do you mind to try reading using `df = dd.read_csv(url)` – rpanai Jul 10 '20 at 19:57
  • 1
    I figured it out. That "from_pandas" doesn't work great, so I had to save my pandas dataframe to a csv first, then use your function, then I was able to re-save it using "df.to_csv('./TEST_CSV.csv', single_file=True)". Thanks!! – wildcat89 Jul 10 '20 at 22:07
  • 1
    If you can save as parquet it will be even better. You'll get faster reading time too. – rpanai Jul 10 '20 at 23:36