Remove duplicate column based on a condition in pandas

Question

I have a DataFrame in which I have a duplicate column namely weather. As Seen in this picture of dataframe. One of them contains NaN values that is the one I want to remove from the DataFrame. I tried this method

data_cleaned4.drop('Weather', axis=1)

It dropped both columns as it should. I tried to pass a condition to drop method but I couldn't. It shows me an error.

data_cleaned4.drop(data_cleaned4['Weather'].isnull().sum() > 0, axis=1)

Can anyone tell me how do I remove this column. Remember that the second last contains the NaN values not the last one.

https://stackoverflow.com/questions/14984119/python-pandas-remove-duplicate-columns Try this one. — Amit Nikhade, Jan 09 '21 at 05:45
I tried `pandas.read_image` but it just came back as "has no attribute". Could you post the dataframe as code until those slackers implement it? — tdelaney, Jan 09 '21 at 06:25

ggaurav · Answer 1 · 2021-01-09T06:33:35.320

1

A general solution. (df.isnull().any(axis=0).values) gets which columns have any NaN values and df.columns.duplicated(keep=False) marks all duplicates as True, both combined will give the columns which you want to retain

General Solution:

df.loc[:, ~((df.isnull().any(axis=0).values) & df.columns.duplicated(keep=False))]

Input

    A   B   C   C   A
0   1   1   1   3.0 NaN
1   1   1   1   2.0 1.0
2   2   3   4   NaN 2.0
3   1   1   1   4.0 1.0

Output

    A   B   C
0   1   1   1
1   1   1   1
2   2   3   4
3   1   1   1

Just for column C:

df.loc[:, ~(df.columns.duplicated(keep=False) & (df.isnull().any(axis=0).values)
            & (df.columns == 'C'))]

Input

    A   B   C   C   A
0   1   1   1   3.0 NaN
1   1   1   1   2.0 1.0
2   2   3   4   NaN 2.0
3   1   1   1   4.0 1.0

Output

    A   B   C   A
0   1   1   1   NaN
1   1   1   1   1.0
2   2   3   4   2.0
3   1   1   1   1.0

edited Jan 09 '21 at 06:33

answered Jan 09 '21 at 06:04

ggaurav

1,764
1
10
10

This one has successfully worked. Thanks for the solution. – Jan 11 '21 at 06:05
Welcome! @Malik As it worked for your problem you may consider accepting, upvoting the answer – ggaurav Jan 11 '21 at 06:15
I have already upvoted your answer but it shows votes cast by those with less than 15 reputation are counted but not added as public. Do upvote my Question please it may increase my reputation. – Jan 11 '21 at 19:23
Done. I guess accepting the answer will still work – ggaurav Jan 11 '21 at 19:33

MaxYarmolinsky · Answer 2 · 2021-01-09T05:51:06.437

Due to the duplicate names you can rename a little bit, that's what the first lien of the code belwo does, then it should work...

data_cleaned4 = data_cleaned4.iloc[:, [j for j, c in enumerate(data_cleaned4.columns) if j != i]]

checkone = data_cleaned4.iloc[:,-1].isna().any()
checktwo = data_cleaned4.iloc[:,-2].isna().any()

if checkone:
    data_cleaned4.drop(data_cleaned4.columns[-1], axis=1)
elif checktwo:
    data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)
else:
    data_cleaned4.drop(data_cleaned4.columns[-2], axis=1)

score 0 · Answer 3 · answered Jan 09 '21 at 05:53

0

Without a testable sample and assuming you don't have NaNs anywhere else in your dataframe

df = df.dropna(axis=1)

should work

answered Jan 09 '21 at 05:53

Kenan

13,156
8
43
50

Remove duplicate column based on a condition in pandas

3 Answers3