Having a tough time finding an example of this, but I'd like to somehow use Dask to drop pairwise correlated columns if their correlation threshold is above 0.99. I CAN'T use Pandas' correlation
function as my dataset is too large, and it eats up my memory in a hurry. What I have now is a slow, double for
loop that starts with the first column, and finds the correlation threshold between it and all the other columns one by one, and if it's above 0.99, drop that 2nd comparative column, then starts at the new second column, and so on and so forth, KIND OF like the solution found here, however this is unbearably slow doing this in an iterative form across all columns, although it's actually possible to run it and not run into memory issues.
I've read the API here, and see how to drop columns using Dask here, but need some assistance in getting this figured out. I'm wondering if there's a faster, yet memory friendly, way of dropping highly correlated columns in a Pandas Dataframe using Dask? I'd like to feed in a Pandas dataframe to the function, and have it return a Pandas dataframe after the correlation dropping is done.
Anyone have any resources I can check out, or have an example of how to do this?
Thanks!
UPDATE As requested, here is my current correlation dropping routine as described above:
print("Checking correlations of all columns...")
cols_to_drop_from_high_corr = []
corr_threshold = 0.99
for j in df.iloc[:,1:]: # Skip column 0
try: # encompass the below in a try/except, cuz dropping a col in the 2nd 'for' loop below will screw with this
# original list, so if a feature is no longer in there from dropping it prior, it'll throw an error
for k in df.iloc[:,1:]: # Start 2nd loop at first column also...
# If comparing the same column to itself, skip it
if (j == k):
continue
else:
try: # second try/except mandatory
correlation = abs(df[j].corr(df[k])) # Get the correlation of the first col and second col
if correlation > corr_threshold: # If they are highly correlated...
cols_to_drop_from_high_corr.append(k) # Add the second col to list for dropping when round is done before next round.")
except:
continue
# Once we have compared the first col with all of the other cols...
if len(cols_to_drop_from_high_corr) > 0:
df = df.drop(cols_to_drop_from_high_corr, axis=1) # Drop all the 2nd highly corr'd cols
cols_to_drop_from_high_corr = [] # Reset the list for next round
# print("Dropped all cols from most recent round. Continuing...")
except: # Now, if the first for loop tries to find a column that's been dropped already, just continue on
continue
print("Correlation dropping completed.")
UPDATE Using the solution below, I'm running into a few errors and due to my limited dask syntax knowledge, I'm hoping to get some insight. Running Windows 10, Python 3.6 and the latest version of dask.
Using the code as is on MY dataset (the dataset in the link says "file not found"), I ran into the first error:
ValueError: Exactly one of npartitions and chunksize must be specified.
So I specify npartitions=2
in the from_pandas
, then get this error:
AttributeError: 'Array' object has no attribute 'compute_chunk_sizes'
I tried changing that to .rechunk('auto')
, but then got error:
ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes
My original dataframe is in the shape of 1275 rows, and 3045 columns. The dask array shape says shape=(nan, 3045). Does this help to diagnose the issue at all?