1

I have a pandas data frame with 83 columns and 4000 rows. I intend to use the data for a logistic regression and therefore want to narrow down my columns to those that have the least amount of missing data.

To do this I was thinking of ranking them based on the frequency of NaN observations. I tried a few things like

econ_balance["BG.GSR.NFSV.GD.ZS"].describe()
econ_balance["BG.GSR.NFSV.GD.ZS"].value_counts
econ_balance["BG.GSR.NFSV.GD.ZS"]["NaN"]
econ_balance["BG.GSR.NFSV.GD.ZS"][NaN]

None of which seem to work. I always tried googling to see if this question has been answered before but no luck.

Thanks in advance for the help

Josh

Josh
  • 1,800
  • 3
  • 15
  • 21

1 Answers1

4

If you're looking just to count the NaN values:

In [2]:

df = pd.DataFrame({'a':[0,1,np.NaN,np.NaN,np.NaN],'b':np.NaN, 'c':[np.NaN,1,2,3,np.NaN]})
df
Out[2]:
    a   b   c
0   0 NaN NaN
1   1 NaN   1
2 NaN NaN   2
3 NaN NaN   3
4 NaN NaN NaN
In [6]:

df.isnull().astype(int).sum()
Out[6]:
a    3
b    5
c    2
dtype: int64

EDIT @CTZhu has pointed out the type casting is unnecessary:

In [7]:

df.isnull().sum()
Out[7]:
a    3
b    5
c    2
dtype: int64
EdChum
  • 376,765
  • 198
  • 813
  • 562