Find all duplicate columns in a collection of data frames

Question

Having a collection of data frames, the goal is to identify the duplicated column names and return them as a list.

Example

The input are 3 data frames df1, df2 and df3:

df1 = pd.DataFrame({'a':[1,5], 'b':[3,9], 'e':[0,7]})

   a  b  e
0  1  3  0
1  5  9  7

df2 = pd.DataFrame({'d':[2,3], 'e':[0,7], 'f':[2,1]})

   d  e  f
0  2  0  2
1  3  7  1

df3 = pd.DataFrame({'b':[3,9], 'c':[8,2], 'e':[0,7]})

   b  c  e
0  3  8  0
1  9  2  7

The output is a list [b, e]

Not sure if it's a dupe, but https://stackoverflow.com/questions/9835762/how-do-i-find-the-duplicates-in-a-list-and-create-another-list-with-them this has every answer posted in this thread so far. — user3483203, Jul 02 '18 at 10:22

jpp · Accepted Answer · 2018-07-02T10:28:32.270

pd.Series.duplicated

Since you are using Pandas, you can use pd.Series.duplicated after concatenating column names:

# concatenate column labels
s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)])

# keep all duplicates only, then extract unique names
res = s[s.duplicated(keep=False)].unique()

print(res)
array(['b', 'e'], dtype=object)

pd.Series.value_counts

Alternatively, you can extract a series of counts and identify rows which have a count greater than 1:

s = pd.concat([df.columns.to_series() for df in (df1, df2, df3)]).value_counts()

res = s[s > 1].index

print(res)
Index(['e', 'b'], dtype='object')

collections.Counter

The classic Python solution is to use collections.Counter followed by a list comprehension. Recall that list(df) returns the columns in a dataframe, so we can use this map and itertools.chain to produce an iterable to feed Counter.

from itertools import chain
from collections import Counter

c = Counter(chain.from_iterable(map(list, (df1, df2, df3))))

res = [k for k, v in c.items() if v > 1]

score 0 · Answer 2 · answered Nov 23 '20 at 22:34

here is my code for this problem, for comparing with only two data frames, with out concat them.

def getDuplicateColumns(df1, df2):
    df_compare = pd.DataFrame({'df1':df1.columns.to_list()})
    df_compare["df2"] = ""
    # Iterate over all the columns in dataframe
    for x in range(df1.shape[1]):
        # Select column at xth index.
        col = df1.iloc[:, x]
        # Iterate over all the columns in DataFrame from (x+1)th index till end
        duplicateColumnNames = []
        for y in range(df2.shape[1]):
            # Select column at yth index.
            otherCol = df2.iloc[:, y]
            # Check if two columns at x y index are equal
            if col.equals(otherCol):                
                duplicateColumnNames.append(df2.columns.values[y])
                df_compare.loc[df_compare["df1"]==df1.columns.values[x], "df2"] = str(duplicateColumnNames)

    return df_compare

hello @M-Chen-3 I think I already put some comment in my code. basically you give my code two data frame, it will compare this two data frames and find the duplication columns. — Sway Wu, Dec 02 '20 at 18:55

Find all duplicate columns in a collection of data frames

2 Answers2

pd.Series.duplicated

pd.Series.value_counts

collections.Counter