1

I want to check a big DataFrame for constant columns and make a 2 list. The first for the columnnames with only zeros the second with the columnnames of constant values (excluding 0)

I found a solution (A in code) at Link but I dont understand it. A is making what i want but i dont know how and how i can get the list.

import numpy as np
import pandas as pd

data = [[0,1,1],[0,1,2],[0,1,3]]
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

A  =df.loc[:, (df != df.iloc[0]).any()]
credenco
  • 255
  • 2
  • 12

2 Answers2

3

Use:

m1 = (df == 0).all()
m2 = (df == df.iloc[0]).all()
a  = df.columns[m1].tolist()
b  = df.columns[~m1 & m2].tolist()
print (a)
['A']
print (b)
['B']

Explanation:

First compare all values by 0:

print (df == 0)
      A      B      C
0  True  False  False
1  True  False  False
2  True  False  False

Then test if all values are Trues by DataFrame.all:

print ((df == 0).all())
A     True
B    False
C    False
dtype: bool

Then compare first values of row by DataFrame.iloc:

print (df == df.iloc[0])
      A     B      C
0  True  True   True
1  True  True  False
2  True  True  False

And test again by all:

print ((df == df.iloc[0]).all())
A     True
B     True
C    False
dtype: bool

because exclude 0 chain inverted first mask by ~ with & for bitwise AND:

print (~m1 & m2)
A    False
B     True
C    False
dtype: bool
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
2

This seems like a clean way to do what you want:

m1 = df.eq(0).all()
m2 = df.nunique().eq(1) & ~m1

m1[m1].index, m2[m2].index
# (Index(['A'], dtype='object'), Index(['B'], dtype='object'))

m1 gives you a boolean of columns that all have zeros:

m1
A     True
B    False
C    False
dtype: bool

m2 gives you all columns with unique values, but not zeros (second condition re-uses the first)

m2
A    False
B     True
C    False
dtype: bool

Deriving your lists is trivial from these masks.

cs95
  • 379,657
  • 97
  • 704
  • 746
  • Thanks, i already tried it with `nunique()`but i got some problems with my dataframe `ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().` In this lil example it worked fine but in my case i was not able to finde the problem – credenco Feb 06 '20 at 08:51
  • 2
    @credenco Start by doing `df = df.select_dtypes('int')` and try it again. – cs95 Feb 06 '20 at 08:56