how to find list of columns with same values in a dataframe in python

Question

i am trying to find list of columns in a data frame with same values in columns. there is a package in R whichAreInDouble, trying implement that in python.

df  =   
a b c d e f g h i   
1 2 3 4 1 2 3 4 5  
2 3 4 5 2 3 4 5 6  
3 4 5 6 3 4 5 6 7

it should give me list of columns with same values like

a, e are equal
b,f are equal 
c,g are equal

`{(col_1, col_2) for col_1 in df.columns for col_2 in df.columns if col_1 != col_2 and df[col_1].equals(df[col_2])}` O(n^2). — Brian, Sep 18 '19 at 17:19
what if there are 1000's of columns? i am working on a huge dataset with 2000 columns. what i thought of is comparing first 10 rows in the two columns and if they match compare next 10 rows. if they dont match move to next column. — Vivek Sthanam, Sep 18 '19 at 17:24

score 4 · Accepted Answer · answered Sep 19 '19 at 02:24

4

Let's try using itertools and combinations:

from itertools import combinations

[(i, j) for i,j in combinations(df, 2) if df[i].equals(df[j])]

Output:

[('a', 'e'), ('b', 'f'), ('c', 'g'), ('d', 'h')]

answered Sep 19 '19 at 02:24

Scott Boston

147,308
15
139
187

Naveen Reddy Marthala · Answer 2 · 2020-11-20T14:38:02.387

Above solution is good. But, it may happen that, two columns basically have same values, but are encoded differently. for example:

  b c d e f
1 1 3 4 1 a
2 3 4 5 2 c 
3 2 5 6 3 b
4 3 4 5 2 c  
5 4 5 6 3 d
6 2 4 5 2 b  
7 4 5 6 3 d

In above example, you could see that column f, after label encoding, will have same values as column b. So, how to catch duplicate columns like these?
Here you go:

from tqdm import tqdm_notebook

# create an empty dataframe with same index as your dataframe(let's call it train_df), which will be filled with factorized version of original data.
train_enc = pd.DataFrame(index=train_df.index)
# now encode all the features 
for col in tqdm_notebook(train_df.columns):
    train_enc[col] = train_df[col].factorize()[0]
# find and print duplicated columns
dup_cols = {}
# start with one feature
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
    # compare it all the remaining features
    for c2 in train_enc.columns[i + 1:]:
        # add the entries to above dict, if matches with the column in first loop
        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
            dup_cols[c2] = c1
# now print dup_cols dictionary would have names of columns as keys that are identical to a column in value.
print(dup_cols)

column names that match with other, when encoded will be printed at stdout.

if you want to drop duplicate columns, you can do:

train_df.drop(columns=dup_cols.keys(), axis=1, inplace=True)

score 2 · Answer 3 · edited Mar 19 '21 at 17:13

2

from itertools import combinations

    cols_to_remove=[]
    for i,j in combinations(chk,2):
        if chk[i].equals(chk[j]):
            cols_to_remove.append(j)
    
    chk=chk.drop(cols_to_remove,axis=1)

edited Mar 19 '21 at 17:13

Hrvoje

13,566
7
90
104

answered Mar 09 '21 at 11:56

waqas ahmed

107
1
4

how to find list of columns with same values in a dataframe in python

3 Answers3