quickly drop dataframe columns with only one distinct value

Question

Is there a faster way to drop columns that only contain one distinct value than the code below?

cols=df.columns.tolist()
for col in cols:
    if len(set(df[col].tolist()))<2:
        df=df.drop(col, axis=1)

This is really quite slow for large dataframes. Logically, this counts the number of values in each column when in fact it could just stop counting after reaching 2 different values.

Strongly related to: https://stackoverflow.com/q/20209600/671013 — Dror, Jul 04 '17 at 08:09
Related question: https://stackoverflow.com/questions/54405704/check-if-all-values-in-dataframe-column-are-the-same-pandas — Nicola, Jan 14 '20 at 16:08

Anand S Kumar · Accepted Answer · 2015-10-15T09:48:40.377

You can use Series.unique() method to find out all the unique elements in a column, and for columns whose .unique() returns only 1 element, you can drop that. Example -

for col in df.columns:
    if len(df[col].unique()) == 1:
        df.drop(col,inplace=True,axis=1)

A method that does not do inplace dropping -

res = df
for col in df.columns:
    if len(df[col].unique()) == 1:
        res = res.drop(col,axis=1)

Demo -

In [154]: df = pd.DataFrame([[1,2,3],[1,3,3],[1,2,3]])

In [155]: for col in df.columns:
   .....:     if len(df[col].unique()) == 1:
   .....:         df.drop(col,inplace=True,axis=1)
   .....:

In [156]: df
Out[156]:
   1
0  2
1  3
2  2

Timing results -

In [166]: %paste
def func1(df):
        res = df
        for col in df.columns:
                if len(df[col].unique()) == 1:
                        res = res.drop(col,axis=1)
        return res

## -- End pasted text --

In [172]: df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})

In [178]: %timeit func1(df)
1000 loops, best of 3: 1.05 ms per loop

In [180]: %timeit df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]
100 loops, best of 3: 8.81 ms per loop

In [181]: %timeit df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)
100 loops, best of 3: 5.81 ms per loop

The fastest method still seems to be the method using unique and looping through the columns.

I would suggest to use .nunique() as described in https://stackoverflow.com/a/54405767/940974 — Nicola, Jan 14 '20 at 16:06

score 18 · Answer 2 · edited Jun 20 '20 at 09:12

18

One step:

df = df[[c for c
        in list(df)
        if len(df[c].unique()) > 1]]

Two steps:

Create a list of column names that have more than 1 distinct value.

keep = [c for c
        in list(df)
        if len(df[c].unique()) > 1]

Drop the columns that are not in 'keep'

df = df[keep]

Note: this step can also be done using a list of columns to drop:

drop_cols = [c for c
             in list(df)
             if df[c].nunique() <= 1]
df = df.drop(columns=drop_cols)

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 25 '15 at 20:37

kait

1,327
9
13

Is there a way to see "Two step" example, but instead of storing to list variables to keep store variables to drop. It will be useful to get list of variables that are droped. – vasili111 Apr 14 '20 at 00:10

jz0410 · Answer 3 · 2018-05-11T10:27:29.087

14

df.loc[:,df.apply(pd.Series.nunique) != 1]

For example

In:
df = pd.DataFrame({'A': [10, 20, np.nan, 30], 'B': [10, np.nan, 10, 10]})
df.loc[:,df.apply(pd.Series.nunique) != 1]

Out:
   A
0  10
1  20
2  NaN
3  30

edited May 11 '18 at 10:27

answered May 11 '18 at 10:15

jz0410

151
1
3

Incorrectly handles in my example `var4` and `var6`: ``` import pandas as pd import numpy as np data = {'var1': [1,2,3,4,5,np.nan,7,8,9], 'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'], 'var3':[101,101,101,102,102,102,103,103,np.nan], 'var4':[np.nan,1,1,1,1,1,1,1,1], 'var5':[1,1,1,1,1,1,1,1,1], 'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'var7': [1,2,3,4,5,6,7,8,9]} df = pd.DataFrame(data) df5 = df.loc[:,df.apply(pd.Series.nunique) != 1]``` – vasili111 Apr 14 '20 at 00:23
@vasili111 Because `np.nan !=np.nan` is True . – Mithril May 06 '20 at 08:49

score 11 · Answer 4 · answered Jun 10 '20 at 08:12

11

Two simple one-liners for either returning a view (shorter version of jz0410's answer)

df.loc[:,df.nunique()!=1]

or dropping inplace (via drop())

df.drop(columns=df.columns[df.nunique()==1], inplace=True)

answered Jun 10 '20 at 08:12

Ben JW

1,370
1
10
11

EdChum · Answer 5 · 2015-10-15T09:40:35.293

You can create a mask of your df by calling apply and call value_counts, this will produce NaN for all rows except one, you can then call dropna column-wise and pass param thresh=2 so that there must be 2 or more non-NaN values:

In [329]:   
df = pd.DataFrame({'a':1, 'b':np.arange(5), 'c':[0,0,2,2,2]})
df

Out[329]:
   a  b  c
0  1  0  0
1  1  1  0
2  1  2  2
3  1  3  2
4  1  4  2

In [342]:
df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]

Out[342]:
   b  c
0  0  0
1  1  0
2  2  2
3  3  2
4  4  2

Output from the boolean conditions:

In [344]:
df.apply(pd.Series.value_counts)

Out[344]:
    a  b   c
0 NaN  1   2
1   5  1 NaN
2 NaN  1   3
3 NaN  1 NaN
4 NaN  1 NaN

In [345]:
df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1)

Out[345]:
   b   c
0  1   2
1  1 NaN
2  1   3
3  1 NaN
4  1 NaN

score 2 · Answer 6 · answered Jul 05 '20 at 19:52

2

I would like to throw in: pandas 1.0.3

ids = df.nunique().values>1
df.loc[:,ids]

not that slow:

2.81 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

answered Jul 05 '20 at 19:52

smurfit89

327
5
17

vasili111 · Answer 7 · 2020-04-14T01:42:38.040

Many examples in thread and this thread does not worked for my df. Those worked:

# from: https://stackoverflow.com/questions/33144813/quickly-drop-dataframe-columns-with-only-one-distinct-value
# from: https://stackoverflow.com/questions/20209600/pandas-dataframe-remove-constant-column

import pandas as pd
import numpy as np


data = {'var1': [1,2,3,4,5,np.nan,7,8,9],
       'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'],
       'var3':[101,101,101,102,102,102,103,103,np.nan], 
       'var4':[np.nan,1,1,1,1,1,1,1,1],
       'var5':[1,1,1,1,1,1,1,1,1],
       'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
       'var7':["a","a","a","a","a","a","a","a","a"],
       'var8': [1,2,3,4,5,6,7,8,9]}


df = pd.DataFrame(data)
df_original = df.copy()



#-------------------------------------------------------------------------------------------------


df2 = df[[c for c
        in list(df)
        if len(df[c].unique()) > 1]]


#-------------------------------------------------------------------------------------------------


keep = [c for c
        in list(df)
        if len(df[c].unique()) > 1]

df3 = df[keep]



#-------------------------------------------------------------------------------------------------



keep_columns = [col for col in df.columns if len(df[col].unique()) > 1]

df5 = df[keep_columns].copy()



#-------------------------------------------------------------------------------------------------



for col in df.columns:
     if len(df[col].unique()) == 1:
         df.drop(col,inplace=True,axis=1)

Why didn't the others work? What was different about your use case? — jtlz2, Oct 12 '22 at 12:28
@jtlz2 I do not exactly remember what was the problem (it was quite a time ago) but looking now on my post I think it may be was the handling of the missing values in data frame. — vasili111, Oct 12 '22 at 17:32

score 1 · Answer 8 · edited Mar 08 '21 at 13:00

1

df=df.loc[:,df.nunique()!=Numberofvalues]

edited Mar 08 '21 at 13:00

Suraj Rao

29,388
11
94
103

answered Mar 08 '21 at 12:56

waqas ahmed

107
1
4

score 0 · Answer 9 · answered Nov 15 '19 at 04:13

0

None of the solutions worked in my use-case because I got this error: (my dataframe contains list item).

TypeError: unhashable type: 'list'

The solution that worked for me is this:

ndf = df.describe(include="all").T
new_cols = set(df.columns) - set(ndf[ndf.unique == 1].index)
df = df[list(new_cols)]

answered Nov 15 '19 at 04:13

shantanuo

31,689
78
245
403

Does not works with my example:`data = {'var1': [1,2,3,4,5,np.nan,7,8,9], 'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'], 'var3':[101,101,101,102,102,102,103,103,np.nan], 'var4':[np.nan,1,1,1,1,1,1,1,1], 'var5':[1,1,1,1,1,1,1,1,1], 'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'var7': [1,2,3,4,5,6,7,8,9]} df = pd.DataFrame(data) df2 = df[df.apply(pd.Series.value_counts).dropna(thresh=2, axis=1).columns]` – vasili111 Apr 14 '20 at 00:30

score 0 · Answer 10 · edited Sep 01 '20 at 22:03

0

One line

df=df[[i for i in df if len(set(df[i]))>1]]

edited Sep 01 '20 at 22:03

Paul Roub

36,322
27
84
93

answered Sep 01 '20 at 21:22

Elvin Aghammadzada

763
7
21

score 0 · Answer 11 · answered Mar 16 '21 at 22:58

0

One of the solutions with pipe (convenient if used often):

def drop_unique_value_col(df):
    return df.loc[:,df.apply(pd.Series.nunique) != 1]

df.pipe(drop_unique_value_col)

answered Mar 16 '21 at 22:58

Adrien Pacifico

1,649
1
15
33

score 0 · Answer 12 · answered Jun 26 '21 at 11:51

0

This will drop all the columns with only one distinct value.

for col in Dataframe.columns:
    
    if len(Dataframe[col].value_counts()) == 1:

        Dataframe.drop([col], axis=1, inplace=True)

answered Jun 26 '21 at 11:51

Adeel Afzal

191
1
4

Better to not use for loops with pandas:) – Rémy Hosseinkhan Boucher May 27 '22 at 08:18

amalik2205 · Answer 13 · 2020-04-09T22:00:15.063

-1

Most 'pythonic' way of doing it I could find:

df = df.loc[:, (df != df.iloc[0]).any()]

edited Apr 09 '20 at 22:00

answered Apr 08 '20 at 13:06

amalik2205

3,962
1
15
21

1

While this may answer the question, please explain *how* your solution solves the problem to make the answer more valuable for future readers. – dspencer Apr 09 '20 at 03:07
Does not works with my example: ` import pandas as pd import numpy as np data = {'var1': [1,2,3,4,5,np.nan,7,8,9], 'var2':['Order',np.nan,'Inv','Order','Order','Shp','Order', 'Order','Inv'], 'var3':[101,101,101,102,102,102,103,103,np.nan], 'var4':[np.nan,1,1,1,1,1,1,1,1], 'var5':[1,1,1,1,1,1,1,1,1], 'var6':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan], 'var7': [1,2,3,4,5,6,7,8,9]} df = pd.DataFrame(data) df2 = df.loc[:, (df != df.iloc[0]).any()]` – vasili111 Apr 14 '20 at 00:31

quickly drop dataframe columns with only one distinct value

13 Answers13

One step:

Two steps:

Linked