Function to drop duplicate columns in pandas dataframe

Asked Dec 05 '15 at 21:17

Active Dec 06 '15 at 05:05

Viewed 342 times

I have a dataframe in pandas with several similar-looking columns (with different names). I'm trying to write a function which compares the data in two columns and drops the second one if they are identical. I've tried this:

import numpy as np
import pandas as pd

def drop_if_ident(df, col1, col2):
    # Drops second column if columns contain identical data
    if (df.shape[0] == np.sum(pd.notnull(df.col1) == pd.notnull(df.col2)):
        df.drop(
            col2,
            axis=1,
            inplace=True
        )

# Usage
drop_if_ident(my_dataframe, my_first_column, my_second_column)

iPython throws the following error:

File "<ipython-input-109-e11b622181bb>", line 3
if (df.shape[0] == np.sum(pd.notnull(df.col1) == pd.notnull(df.col2)):
                                                                     ^
SyntaxError: invalid syntax

...but what is the correct syntax here? Apologies for the noob question :)

asked Dec 05 '15 at 21:17

user1684046

1,739
2
13
15

You are missing a paranthesis. if (df.shape[0] == np.sum(pd.notnull(df.col1) == pd.notnull(df.col2))): – Liam Foley Dec 05 '15 at 21:18
Is the answer given [here](http://stackoverflow.com/a/16939512), not acceptable? – Boa Dec 05 '15 at 21:23
Thanks Liam - can't believe I missed that. – user1684046 Dec 05 '15 at 21:47
Please add an answer and tag this as solved. – gabra Dec 05 '15 at 22:30

Function to drop duplicate columns in pandas dataframe

0 Answers0