40

Is there a faster way to drop columns that only contain one distinct value than the code below?

cols=df.columns.tolist()
for col in cols:
    if len(set(df[col].tolist()))<2:
        df=df.drop(col, axis=1)

This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

Alexis Eggermont
  • 7,665
  • 24
  • 60
  • 93

13 Answers13

38

You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -

for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)

A method that does not do inplace dropping -

res = df
for col in df.columns:
    if len(df[col].unique()) == 1:
        res = res.drop(col,axis=1)

Demo -

In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])

In [155]: for col in df.columns:
   .....:     if len(df[col].unique()) == 1:
   .....:         df.drop(col,inplace=True,axis=1)
   .....:

In [156]: df
Out[156]:
   1
0  2
1  3
2  2

Timing results -

In [166]: %paste
def func1(df):
        res = df
        for col in df.columns:
                if len(df[col].unique()) == 1:
                        res = res.drop(col,axis=1)
        return res

## -- End pasted text --

In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})

In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop

In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop

In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop

The fastest method still seems to be the method using unique and looping through the columns.

Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • 2
    I would suggest to use .nunique() as described in https://stackoverflow.com/a/54405767/940974 – Nicola Jan 14 '20 at 16:06
18

One step:

df = df[[c for c
        in list(df)
        if len(df[c].unique()) > 1]]

Two steps:

Create a list of column names that have more than 1 distinct value.

keep = [c for c
        in list(df)
        if len(df[c].unique()) > 1]

Drop the columns that are not in 'keep'

df = df[keep]

Note: this step can also be done using a list of columns to drop:

drop_cols = [c for c
             in list(df)
             if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)
Community
  • 1
  • 1
kait
  • 1,327
  • 9
  • 13
  • Is there a way to see "Two step" example, but instead of storing to list variables to keep store variables to drop. It will be useful to get list of variables that are droped. – vasili111 Apr 14 '20 at 00:10
14
df.loc[:,df.apply(pd.Series.nunique) != 1]

For example

In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]

Out:
   A
0  10
1  20
2  NaN
3  30
jz0410
  • 151
  • 1
  • 3
  • Incorrectly handles in my example `var4` and `var6`: ``` import pandas as pd import numpy as np data = {'var1': [1,2,3,4,5,np.nan,7,8,9], 'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'], 'var3':[101,101,101,102,102,102,103,103,np.nan], 'var4':[np.nan,1,1,1,1,1,1,1,1], 'var5':[1,1,1,1,1,1,1,1,1], 'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'var7': [1,2,3,4,5,6,7,8,9]} df = pd.DataFrame(data) df5 = df.loc[:,df.apply(pd.Series.nunique) != 1]``` – vasili111 Apr 14 '20 at 00:23
  • @vasili111 Because `np.nan !=np.nan` is True . – Mithril May 06 '20 at 08:49
11

Two simple one-liners for either returning a view (shorter version of jz0410's answer)

df.loc[:,df.nunique()!=1]

or dropping inplace (via drop())

df.drop(columns=df.columns[df.nunique()==1], inplace=True)
Ben JW
  • 1,370
  • 1
  • 10
  • 11
2

You can create a mask of your df by calling apply and call value_counts, this will produce NaN for all rows except one, you can then call dropna column-wise and pass param thresh=2 so that there must be 2 or more non-NaN values:

In [329]:   
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df

Out[329]:
   a  b  c
0  1  0  0
1  1  1  0
2  1  2  2
3  1  3  2
4  1  4  2

In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]

Out[342]:
   b  c
0  0  0
1  1  0
2  2  2
3  3  2
4  4  2

Output from the boolean conditions:

In [344]:
df.apply(pd.Series.value_counts)

Out[344]:
    a  b   c
0 NaN  1   2
1   5  1 NaN
2 NaN  1   3
3 NaN  1 NaN
4 NaN  1 NaN

In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)

Out[345]:
   b   c
0  1   2
1  1 NaN
2  1   3
3  1 NaN
4  1 NaN
EdChum
  • 376,765
  • 198
  • 813
  • 562
2

I would like to throw in: pandas 1.0.3

ids = df.nunique().values>1
df.loc[:,ids]

not that slow:

2.81 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
smurfit89
  • 327
  • 5
  • 17
1

Many examples in thread and this thread does not worked for my df. Those worked:

# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column

import pandas as pd
import numpy as np


data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
       'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
       'var3':[101,101,101,102,102,102,103,103,np.nan], 
       'var4':[np.nan,1,1,1,1,1,1,1,1],
       'var5':[1,1,1,1,1,1,1,1,1],
       'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
       'var7':["a","a","a","a","a","a","a","a","a"],
       'var8': [1,2,3,4,5,6,7,8,9]}


df = pd.DataFrame(data)
df_original = df.copy()



#-------------------------------------------------------------------------------------------------


df2 = df[[c for c
        in list(df)
        if len(df[c].unique()) > 1]]


#-------------------------------------------------------------------------------------------------


keep = [c for c
        in list(df)
        if len(df[c].unique()) > 1]

df3 = df[keep]



#-------------------------------------------------------------------------------------------------



keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]

df5 = df[keep_columns].copy()



#-------------------------------------------------------------------------------------------------



for col in df.columns:
     if len(df[col].unique()) == 1:
         df.drop(col,inplace=True,axis=1)
vasili111
  • 6,032
  • 10
  • 50
  • 80
  • Why didn't the others work? What was different about your use case? – jtlz2 Oct 12 '22 at 12:28
  • 1
    @jtlz2 I do not exactly remember what was the problem (it was quite a time ago) but looking now on my post I think it may be was the handling of the missing values in data frame. – vasili111 Oct 12 '22 at 17:32
1
df=df.loc[:,df.nunique()!=Numberofvalues]
Suraj Rao
  • 29,388
  • 11
  • 94
  • 103
waqas ahmed
  • 107
  • 1
  • 4
0

None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).

TypeError: unhashable type: 'list'

The solution that worked for me is this:

ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]  
shantanuo
  • 31,689
  • 78
  • 245
  • 403
  • Does not works with my example:`data = {'var1': [1,2,3,4,5,np.nan,7,8,9], 'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'], 'var3':[101,101,101,102,102,102,103,103,np.nan], 'var4':[np.nan,1,1,1,1,1,1,1,1], 'var5':[1,1,1,1,1,1,1,1,1], 'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'var7': [1,2,3,4,5,6,7,8,9]} df = pd.DataFrame(data) df2 = df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]` – vasili111 Apr 14 '20 at 00:30
0

One line

df=df[[i for i in df if len(set(df[i]))>1]]
Paul Roub
  • 36,322
  • 27
  • 84
  • 93
0

One of the solutions with pipe (convenient if used often):

def drop_unique_value_col(df):
    return df.loc[:,df.apply(pd.Series.nunique) != 1]

df.pipe(drop_unique_value_col)
Adrien Pacifico
  • 1,649
  • 1
  • 15
  • 33
0

This will drop all the columns with only one distinct value.

for col in Dataframe.columns:
    
    if len(Dataframe[col].value_counts()) == 1:

        Dataframe.drop([col], axis=1, inplace=True)
Adeel Afzal
  • 191
  • 1
  • 4
-1

Most 'pythonic' way of doing it I could find:

df = df.loc[:, (df != df.iloc[0]).any()]
amalik2205
  • 3,962
  • 1
  • 15
  • 21
  • 1
    While this may answer the question, please explain *how* your solution solves the problem to make the answer more valuable for future readers. – dspencer Apr 09 '20 at 03:07
  • Does not works with my example: ` import pandas as pd import numpy as np data = {'var1': [1,2,3,4,5,np.nan,7,8,9], 'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'], 'var3':[101,101,101,102,102,102,103,103,np.nan], 'var4':[np.nan,1,1,1,1,1,1,1,1], 'var5':[1,1,1,1,1,1,1,1,1], 'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'var7': [1,2,3,4,5,6,7,8,9]} df = pd.DataFrame(data) df2 = df.loc[:, (df != df.iloc[0]).any()]` – vasili111 Apr 14 '20 at 00:31