50

I have a dataframe that may or may not have columns that are the same value. For example

    row    A    B
    1      9    0
    2      7    0
    3      5    0
    4      2    0

I'd like to return just

   row    A  
   1      9    
   2      7    
   3      5    
   4      2

Is there a simple way to identify if any of these columns exist and then remove them?

Scott Boston
  • 147,308
  • 15
  • 139
  • 187
user1802143
  • 14,662
  • 17
  • 46
  • 55

6 Answers6

81

I believe this option will be faster than the other answers here as it will traverse the data frame only once for the comparison and short-circuit if a non-unique value is found.

>>> df

   0  1  2
0  1  9  0
1  2  7  0
2  3  7  0

>>> df.loc[:, (df != df.iloc[0]).any()] 

   0  1
0  1  9
1  2  7
2  3  7
Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
chthonicdaemon
  • 19,180
  • 2
  • 52
  • 66
  • +1 thanks for changing. This short circuits on the any, after it's already done the != comparison on every element, so DSM's solution will probably be more efficient... wonder if better short circuiting solution. – Andy Hayden Nov 27 '13 at 07:09
  • In my tests, my solution is always faster than counting the unique elements, although the factor varies from 0.1 for a 10×10 DataFrame to around 0.5 for 10000×10. I think the memory you save by not calculating the full equality array trades off against the extra time involved in counting all the unique values (and maintaining a table of values already seen and so on). – chthonicdaemon Nov 27 '13 at 07:24
  • 1
    Good point, take back the more efficient! Still wonder if way to short circuit the != after first difference it sees. – Andy Hayden Nov 27 '13 at 07:33
  • Note that a column with NaNs will not be considered constant. This is technically correct (because NaN ≠ Nan), but this is probably not what we want (since there is no practical difference between each NaN). – Eric O. Lebigot Sep 15 '17 at 16:47
  • @EOL luckily we have a simple way to get rid of all-nan columns (`.dropna`) – chthonicdaemon Sep 16 '17 at 12:56
  • Indeed: `df.dropna(axis=1, how="all")`. – Eric O. Lebigot Sep 17 '17 at 06:41
  • 1
    I have a column that is a timestamp and I get `TypeError: int() argument must be a string, a bytes-like object or a number, not 'Timestamp'` don't understand why. – DdD Oct 30 '17 at 12:39
  • It appears that the columns must all be of the same type (or at least all numeric or all timestamp). You'll have to apply the method on the different columns separately unfortunately. – chthonicdaemon Oct 31 '17 at 12:07
27

Ignoring NaNs like usual, a column is constant if nunique() == 1. So:

>>> df
   A  B  row
0  9  0    1
1  7  0    2
2  5  0    3
3  2  0    4
>>> df = df.loc[:,df.apply(pd.Series.nunique) != 1]
>>> df
   A  row
0  9    1
1  7    2
2  5    3
3  2    4
DSM
  • 342,061
  • 65
  • 592
  • 494
  • 2
    `df.apply(pd.Series.nunique)` is more simply `df.nunique()`, in Pandas 0.20.3 at least. – Eric O. Lebigot Sep 17 '17 at 06:46
  • 1
    And if we want NaN to be considered as a unique value, `df.nunique(dropna=False)` works well (it handles the fact that NaN ≠ NaN as we expect, counting all NaN values as the same value even though they are not equal). – Eric O. Lebigot Sep 17 '17 at 06:48
  • Another alternative using [`nunique`](https://pandas.pydata.org/docs/reference/api/pandas.Series.nunique.html): `df[df.columns[df.nunique() > 1]]` – rachwa Jun 10 '22 at 18:18
  • 1
    @EricOLebigot Subtle but helpful point about inquality of and uniqueness of NaNs! – jtlz2 Oct 12 '22 at 12:30
14

I compared various methods on data frame of size 120*10000. And found the efficient one is

def drop_constant_column(dataframe):
    """
    Drops constant value columns of pandas dataframe.
    """
    return dataframe.loc[:, (dataframe != dataframe.iloc[0]).any()]

1 loop, best of 3: 237 ms per loop

The other contenders are

def drop_constant_columns(dataframe):
    """
    Drops constant value columns of pandas dataframe.
    """
    result = dataframe.copy()
    for column in dataframe.columns:
        if len(dataframe[column].unique()) == 1:
            result = result.drop(column,axis=1)
    return result

1 loop, best of 3: 19.2 s per loop

def drop_constant_columns_2(dataframe):
    """
    Drops constant value columns of pandas dataframe.
    """
    for column in dataframe.columns:
        if len(dataframe[column].unique()) == 1:
            dataframe.drop(column,inplace=True,axis=1)
    return dataframe

1 loop, best of 3: 317 ms per loop

def drop_constant_columns_3(dataframe):
    """
    Drops constant value columns of pandas dataframe.
    """
    keep_columns = [col for col in dataframe.columns if len(dataframe[col].unique()) > 1]
    return dataframe[keep_columns].copy()

1 loop, best of 3: 358 ms per loop

def drop_constant_columns_4(dataframe):
    """
    Drops constant value columns of pandas dataframe.
    """
    keep_columns = dataframe.columns[dataframe.nunique()>1]
    return dataframe.loc[:,keep_columns].copy()

1 loop, best of 3: 1.8 s per loop

Yantraguru
  • 3,604
  • 3
  • 18
  • 21
  • Using len(df.col.unique()) is very expensive. A simple df.col.nunique() will give the same result with significantly less overhead. – Yash Nag Apr 24 '19 at 10:17
4

Assuming that the DataFrame is completely of type numeric:

you can try:

>>> df = df.loc[:, df.var() == 0.0]

which will remove constant(i.e. variance = 0) columns.

If the DataFrame is of type both numeric and object, then you should try:

>>> enum_df = df.select_dtypes(include=['object'])
>>> num_df = df.select_dtypes(exclude=['object'])
>>> num_df = num_df.loc[:, num_df.var() == 0.0]
>>> df = pd.concat([num_df, enum_df], axis=1)

which will drop constant columns of numeric type only.

If you also want to ignore/delete constant enum columns, you should try:

>>> enum_df = df.select_dtypes(include=['object'])
>>> num_df = df.select_dtypes(exclude=['object'])
>>> enum_df = enum_df.loc[:, [True if y !=1 else False for y in [len(np.unique(x, return_counts=True)[-1]) for x in enum_df.T.as_matrix()]]]
>>> num_df = num_df.loc[:, num_df.var() == 0.0]
>>> df = pd.concat([num_df, enum_df], axis=1)
Hng
  • 148
  • 4
  • 4
    Presumably you would want `df = df.loc[:, ~df.var() == 0.0]` otherwise you are selecting the 0 columns. It's probably also worth doing `np.isclose(0, df.var())` for possible floating point errors – jeremycg Mar 05 '18 at 18:54
0

Here is my solution since I needed to do both object and numerical columns. Not claiming its super efficient or anything but it gets the job done.

def drop_constants(df):
    """iterate through columns and remove columns with constant values (all same)"""
    columns = df.columns.values
    for col in columns:
        # drop col if unique values is 1
        if df[col].nunique(dropna=False) == 1:
            del df[col]
    return df

Extra caveat, it won't work on columns of lists or arrays since they are not hashable.

dreyco676
  • 911
  • 1
  • 8
  • 14
0

Many examples in this thread does not work properly. Check this my answer with collection of examples that work

vasili111
  • 6,032
  • 10
  • 50
  • 80