Given a dataframe, I want to find all duplicate columns (column names are different, values are same) and then group them by key into a dictionary. I have a solution but it involves a nested for loop and I was thinking there should be a way to do this more elegantly or directly in pandas. I am using remove duplicate columns... as part of my current solution. This find duplicates in list... sounded similar to my question but answers a different problem. My original application is to create mask columns for missing data and to be able to use a single mask column for all columns with same missing data pattern.
df = pd.DataFrame({'col1':[0,1,2,3,4],'col2':[1,0,0,0,1],'col3':[1,0,0,0,1],'col4':[1,0,1,0,1],'col5':[1,0,1,0,1],'col6':[1,1,1,0,1],'col7':[1,0,0,0,1] })
dup_cols = df.T.drop_duplicates().T.columns.tolist()
tmp_dict = {}
for col in dup_cols:
tmp[col] = []
for col in dup_cols:
check_cols = [c for c in df.columns if c != col]
for c in check_cols:
if np.array_equal(df[col].values,df[c].values):
tmp_dict[col].append(c)
>>>tmp_dict
{'col1': [], 'col2': ['col3', 'col7'], 'col4': ['col5'], 'col6': []}