0

Given a dataframe, I want to find all duplicate columns (column names are different, values are same) and then group them by key into a dictionary. I have a solution but it involves a nested for loop and I was thinking there should be a way to do this more elegantly or directly in pandas. I am using remove duplicate columns... as part of my current solution. This find duplicates in list... sounded similar to my question but answers a different problem. My original application is to create mask columns for missing data and to be able to use a single mask column for all columns with same missing data pattern.

df = pd.DataFrame({'col1':[0,1,2,3,4],'col2':[1,0,0,0,1],'col3':[1,0,0,0,1],'col4':[1,0,1,0,1],'col5':[1,0,1,0,1],'col6':[1,1,1,0,1],'col7':[1,0,0,0,1] })
dup_cols = df.T.drop_duplicates().T.columns.tolist()
tmp_dict = {}
for col in dup_cols:
    tmp[col] = []
for col in dup_cols:
    check_cols = [c for c in df.columns if c != col]
    for c in check_cols:
        if np.array_equal(df[col].values,df[c].values):
            tmp_dict[col].append(c)

>>>tmp_dict
{'col1': [], 'col2': ['col3', 'col7'], 'col4': ['col5'], 'col6': []}
Zak Keirn
  • 807
  • 13
  • 22
  • If you haven't done so already, I'd suggest taking a look at `df.fillna()` for replacing NaN. You can specify an individual fill value for each column, so maybe there is no need to create masks. – WolfgangK Feb 08 '19 at 00:58
  • @WolfgangK I don't see how that helps with OP's question... – Joe Patten Feb 08 '19 at 01:00
  • I dont quite understand your ultimate goal, but I think you can simplify everything by defining `df1 = df.T` and then you can do `df1.drop_duplicates()` and transpose back in the end. – ALollz Feb 08 '19 at 01:06

1 Answers1

1

You could group by using all columns but the first (because it corresponds to the original columns names) and then build the expected result using a dictionary comprehension and extended iterable unpacking:

import pandas as pd

df = pd.DataFrame({'col1': [0, 1, 2, 3, 4], 'col2': [1, 0, 0, 0, 1], 'col3': [1, 0, 0, 0, 1], 'col4': [1, 0, 1, 0, 1],
                   'col5': [1, 0, 1, 0, 1], 'col6': [1, 1, 1, 0, 1], 'col7': [1, 0, 0, 0, 1]})

transpose = df.T

# build all column list but the first
columns = list(range(1, len(df)))

# build result iterating over groups
result = {head: tail for _, (head, *tail) in transpose.reset_index().groupby(columns).index}

print(result)

Output

{'col1': [], 'col4': ['col5'], 'col6': [], 'col2': ['col3', 'col7']}
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Could you add some more detail on how the dictionary comprehension works with the extended iterable unpacking? Trying to wrap my head around it. I see need for `*tail` but am confused by the `for _, (head, *tail)` – Zak Keirn Feb 08 '19 at 01:56
  • @ZakKeirn The `... groupby(columns).index` returns a key, value tuple, in your case you only care about the value, hence the _ for the key, meaning that you don't care about that element. – Dani Mesejo Feb 08 '19 at 02:00
  • Key to my understanding this solution was to print out 'list(transpose.reset_index().groupby(columns).index)` and note the structure. – Zak Keirn Feb 08 '19 at 15:21