Above solution is good. But, it may happen that, two columns basically have same values, but are encoded differently. for example:
b c d e f
1 1 3 4 1 a
2 3 4 5 2 c
3 2 5 6 3 b
4 3 4 5 2 c
5 4 5 6 3 d
6 2 4 5 2 b
7 4 5 6 3 d
In above example, you could see that column f, after label encoding, will have same values as column b. So, how to catch duplicate columns like these?
Here you go:
from tqdm import tqdm_notebook
# create an empty dataframe with same index as your dataframe(let's call it train_df), which will be filled with factorized version of original data.
train_enc = pd.DataFrame(index=train_df.index)
# now encode all the features
for col in tqdm_notebook(train_df.columns):
train_enc[col] = train_df[col].factorize()[0]
# find and print duplicated columns
dup_cols = {}
# start with one feature
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
# compare it all the remaining features
for c2 in train_enc.columns[i + 1:]:
# add the entries to above dict, if matches with the column in first loop
if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
dup_cols[c2] = c1
# now print dup_cols dictionary would have names of columns as keys that are identical to a column in value.
print(dup_cols)
column names that match with other, when encoded will be printed at stdout.
if you want to drop duplicate columns, you can do:
train_df.drop(columns=dup_cols.keys(), axis=1, inplace=True)