Drop the duplicate of a certain column when there are many duplicate columns

Question

I have a dataframe with multiple duplicate columns but I would like to drop the duplicate of the "class" column while keeping other duplicate columns intact. Below you can see there are many duplicate columns. However, I am only interested in dropping the "class" column and keep one copy of it only. The other columns should stay intact and row number should not change.

Dataframe:

train = pd.DataFrame({'class': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 1,
  9: 2,
  10: 3,
  11: 4,
  12: 5,
  13: 6,
  14: 7,
  15: 8},
 'class.1': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 1,
  9: 2,
  10: 3,
  11: 4,
  12: 5,
  13: 6,
  14: 7,
  15: 8},
 'class.2': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 1,
  9: 2,
  10: 3,
  11: 4,
  12: 5,
  13: 6,
  14: 7,
  15: 8},
 'x_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'x_feature_1.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'x_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_2.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_1.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296}})

expected:

expected = pd.DataFrame({'class': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 1,
  9: 2,
  10: 3,
  11: 4,
  12: 5,
  13: 6,
  14: 7,
  15: 8},
 'x_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'x_feature_1.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'x_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_2.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_1.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296}})

[in]:

train = train.loc[:,~(train["class"].duplicated())]

[out]:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Edit: Added example dataframe and expected output dataframe.

Omit `:` because it filter by columns and also `loc` is not necessary like: `train = train[~train["class"].duplicated()]` — jezrael, Jan 07 '20 at 12:05
@jezrael these solutions shorten the dataframe somehow it dropped from 896 rows to 8 rows. I think they deleted the dupplicate rows in term of class — Mine, Jan 07 '20 at 12:13
@jezrael Edited the question to include an example dataframe — Mine, Jan 07 '20 at 12:32

jezrael · Accepted Answer · 2020-01-07T12:58:13.323

1

You can match not duplicated column with values before . with split and chaining mask by select columns starting by class:

m1 = train.columns.str.startswith('class')
m2 = train.columns.str.split('.').str[0].duplicated()
train = train.loc[:, ~m1 | ~m2]
print (train)

edited Jan 07 '20 at 12:58

answered Jan 07 '20 at 11:59

jezrael

822,522
95
1,334
1,252

How is this not a duplicate? – Erfan Jan 07 '20 at 12:04

Drop the duplicate of a certain column when there are many duplicate columns

1 Answers1