23

I want to get the names of the columns which have same values across all rows for each column.

My data:

   A   B  C  D
0  1  hi  2  a
1  3  hi  2  b
2  4  hi  2  c

Desired output:

['B', 'C']

Code:

import pandas as pd

d = {'A': [1,3,4], 'B': ['hi','hi','hi'], 'C': [2,2,2], 'D': ['a','b','c']}
df = pd.DataFrame(data=d)

I've been playing around with df.columns and .any(), but can't figure out how to do this.

tbienias
  • 629
  • 2
  • 7
  • 15
  • 1
    *" Get all columns that have constant value" / "only have a single value"* – smci May 29 '18 at 10:42
  • Is performance important? Then check timings in my answer. – jezrael May 29 '18 at 10:51
  • In fact there was an existing duplicate [Select pandas dataframe columns which have only one unique value](https://stackoverflow.com/questions/49099163/select-pandas-dataframe-columns-which-have-only-one-unique-value). Should we close that into this? (because this has more answers) @jezrael you answered both! – smci May 29 '18 at 11:29

4 Answers4

38

Use the pandas not-so-well-known builtin nunique():

df.columns[df.nunique() <= 1]
Index(['B', 'C'], dtype='object')

Notes:

  • Use nunique(dropna=False) option if you want na's counted as a separate value
  • It's the cleanest code, but not the fastest. (But in general code should prioritize clarity and readability).
smci
  • 32,567
  • 20
  • 113
  • 146
  • @Mohamed Thasin ah solution is the fastest for my dataset (30.000 rows 195 columns), however since I value readability over speed on this one, I go for your `nunique()` solution - thx! – tbienias May 29 '18 at 11:31
  • This answer works only on the provided sample data but does not answer the question. If one of the columns has two rows with the same value it will reported with this answer and the question is about columns with constant values. `pd.Series(data=[1, 2, 2, 3]).is_unique` => `False` and this is of course the expected behaviour of such method – Mousa Sep 10 '19 at 14:26
  • @Mousa: The second solution I showed, using `df.nunique()==1` does work. Updated this. Thanks for the comment – smci Sep 10 '19 at 18:00
  • 1
    `nunique` with `dropna=True` (the default) reports cols with only `NaN`s as zero, thus `df.nunique()<=1` seems to be the more general solution. – NichtJens Oct 30 '19 at 12:59
  • 1
    @NichtJens: excellent, thanks for the refinement! `nunique()` is totally underdocumented and so are recipes involving it... must file some docbugs/enhances on pandas docs. – smci Oct 30 '19 at 18:09
  • Only example that can shows columns with missing values too. – vasili111 Apr 14 '20 at 01:37
  • @vasili111: create your own example for that, it's trivial. I showed the working code for it. – smci Apr 14 '20 at 21:31
  • 1
    I already created it. I made that comment for other readers that your example is only one in this thread that can properly solve problem with column filled with missing values. I do not asking any additional code. – vasili111 Apr 14 '20 at 23:18
8

Solution 1:

c = [c for c in df.columns if len(set(df[c])) == 1]
print (c)

['B', 'C']

Solution 2:

c = df.columns[df.eq(df.iloc[0]).all()].tolist()
print (c)
['B', 'C']

Explanation for Solution 2:

First compare all rows to the first row with DataFrame.eq...

print (df.eq(df.iloc[0]))
       A     B     C      D
0   True  True  True   True
1  False  True  True  False
2  False  True  True  False

... then check each column is all Trues with DataFrame.all...

print (df.eq(df.iloc[0]).all())
A    False
B     True
C     True
D    False
dtype: bool

... finally filter columns' names for which result is True:

print (df.columns[df.eq(df.iloc[0]).all()])
Index(['B', 'C'], dtype='object')

Timings:

np.random.seed(100)
df = pd.DataFrame(np.random.randint(10, size=(1000,100)))

df[np.random.randint(100, size=20)] = 100
print (df)

# Solution 1 (second-fastest):
In [243]: %timeit ([c for c in df.columns if len(set(df[c])) == 1])
3.59 ms ± 43.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Solution 2 (fastest):
In [244]: %timeit df.columns[df.eq(df.iloc[0]).all()].tolist()
1.62 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

#Mohamed Thasin ah solution
In [245]: %timeit ([col for col in df.columns if len(df[col].unique())==1])
6.8 ms ± 352 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#jpp solution
In [246]: %%timeit
     ...: vals = df.apply(set, axis=0)
     ...: res = vals[vals.map(len) == 1].index
     ...: 
5.59 ms ± 64.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#smci solution 1
In [275]: %timeit df.columns[ df.nunique()==1 ]
11 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#smci solution 2
In [276]: %timeit [col for col in df.columns if not df[col].is_unique]
9.25 ms ± 80 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

#smci solution 3
In [277]: %timeit df.columns[ df.apply(lambda col: not col.is_unique) ]
11.1 ms ± 511 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • @smci - Thank you. – jezrael May 29 '18 at 10:55
  • I'd beware of claiming `len(set(df[c]))` is in general faster than the native pandas `len(df[col].unique())`, this toy example only has 3 values. Try it on a bigger df. `set(df[c])` could blow out memory if used on a large string or categorical column. – smci May 29 '18 at 10:56
  • @smci - tested in bigger DataFrame, `pd.DataFrame(np.random.randint(10, size=(1000,100)))` – jezrael May 29 '18 at 10:57
  • and len(set(...)) is still faster? can you post the timings? – smci May 29 '18 at 11:00
  • @smci - my timings are with larger dataframe only, not with sample OP data. – jezrael May 29 '18 at 11:01
  • See my answer `df.columns[ df.nunique()==1 ]`. Care to time it? – smci May 29 '18 at 11:05
  • There's also a builtin [Series.is_unique](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.is_unique.html) – smci May 29 '18 at 11:23
  • `df.columns[ df.apply(attrgetter('is_unique')) == False ]` – smci May 29 '18 at 11:45
  • Since the index information isn't really needed, it'd probably be faster to do `df.columns[(df.values == df.values[0, :]).all(0)]`. – miradulo May 29 '18 at 13:38
  • I think this answer should be the accepted one. Please check this comment on the accepted answer: https://stackoverflow.com/questions/50582168/pandas-get-all-columns-that-have-constant-value#comment102170681_50582784 – Mousa Sep 10 '19 at 14:28
  • How to show also columns that have only missing values? – vasili111 Apr 14 '20 at 01:35
  • @vasili111 - Do you think `df.loc[:, df.isna().all()]` ? – jezrael Apr 14 '20 at 06:53
  • That leaves only columns with missing values. I collected some examples that work fine with different data in this answer: https://stackoverflow.com/a/61199019/1601703 – vasili111 Apr 14 '20 at 16:03
3

try this,

print [col for col in df.columns if len(df[col].unique())==1]

Output:

['B', 'C']
Mohamed Thasin ah
  • 10,754
  • 11
  • 52
  • 111
  • 1
    pandas now has a builtin [`df.nunique()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.unique.html) – smci May 29 '18 at 11:06
1

You can use set and apply a filter on a series:

vals = df.apply(set, axis=0)
res = vals[vals.map(len) == 1].index

print(res)

Index(['B', 'C'], dtype='object')

Use res.tolist() if having a list output is important.

jpp
  • 159,742
  • 34
  • 281
  • 339