1

I have a dataframe with multiple duplicate columns but I would like to drop the duplicate of the "class" column while keeping other duplicate columns intact. Below you can see there are many duplicate columns. However, I am only interested in dropping the "class" column and keep one copy of it only. The other columns should stay intact and row number should not change.

Dataframe:

train = pd.DataFrame({'class': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 1,
  9: 2,
  10: 3,
  11: 4,
  12: 5,
  13: 6,
  14: 7,
  15: 8},
 'class.1': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 1,
  9: 2,
  10: 3,
  11: 4,
  12: 5,
  13: 6,
  14: 7,
  15: 8},
 'class.2': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 1,
  9: 2,
  10: 3,
  11: 4,
  12: 5,
  13: 6,
  14: 7,
  15: 8},
 'x_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'x_feature_1.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'x_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_2.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_1.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296}})

expected:

expected = pd.DataFrame({'class': {0: 1,
  1: 2,
  2: 3,
  3: 4,
  4: 5,
  5: 6,
  6: 7,
  7: 8,
  8: 1,
  9: 2,
  10: 3,
  11: 4,
  12: 5,
  13: 6,
  14: 7,
  15: 8},
 'x_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'x_feature_1.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'x_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'y_feature_2.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_1.1': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296},
 'z_feature_2': {0: -0.30424321,
  1: 1.6273111,
  2: 0.66127653,
  3: 0.0051847840000000004,
  4: 1.2861978,
  5: -0.47925246,
  6: 1.4743277,
  7: 0.30530296,
  8: -0.30424321,
  9: 1.6273111,
  10: 0.66127653,
  11: 0.0051847840000000004,
  12: 1.2861978,
  13: -0.47925246,
  14: 1.4743277,
  15: 0.30530296}})

[in]:

train = train.loc[:,~(train["class"].duplicated())]

[out]:

IndexingError: Unalignable boolean Series provided as indexer (index of the boolean Series and of the indexed object do not match).

Edit: Added example dataframe and expected output dataframe.

Mine
  • 831
  • 1
  • 8
  • 27

1 Answers1

1

You can match not duplicated column with values before . with split and chaining mask by select columns starting by class:

m1 = train.columns.str.startswith('class')
m2 = train.columns.str.split('.').str[0].duplicated()
train = train.loc[:, ~m1 | ~m2]
print (train)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252