How to show counts of null string-like column values in pandas

Question

The dataframe.describe() has the following columns for string like columns:

count unique top freq first last

While these are certainly useful it is also v important to understand if there were null values in any given columns and how many.

While I could resort to writing custom function to find this it would be a significant additional overhead. Note that there is a related question but that focuses on numeric columns and is thus not directly applicable: How to count the NaN values in a column in pandas DataFrame. So is there another helper function that can provide that additional information?

what about `df.describe(include="all")`? The count is the count of non-nulls. So if you know the length, you know how many are null. For example: `df.shape[0] - df.describe(include="all").loc["count", :]` — pault, Jul 31 '19 at 17:49
@VivekSolanki Please make that an answer . Oh hold on .. the `info()` does _display_ the null counts needed but my need is to capture the counts in a Series or DataFrame not just have it blindly dumped to stdout. Any suggestions on that? — WestCoastProjects, Jul 31 '19 at 17:52
What do you mean by capture the counts in a series or df? please elaborate — Vivek Solanki, Jul 31 '19 at 17:56
`info()` prints to console . that is not helpful when the goal is to store the summary stats in an in memory data structure — WestCoastProjects, Jul 31 '19 at 17:59
@QuangHoang Yes! Please make that an answer and I will accept — WestCoastProjects, Jul 31 '19 at 18:01

score 3 · Accepted Answer · answered Jul 31 '19 at 18:03

3

For a quick glance of number of nan in each columns:

dataframe.isna().sum()

answered Jul 31 '19 at 18:03

Quang Hoang

146,074
10
56
74

1

actually this is not just a "quick glance" since a fully structured dataframe is returned. I need that for in-memory data structure. I will probably join this result to the dataframe returned by `pd.describe()` to give a complete one-place summary – WestCoastProjects Jul 31 '19 at 18:04
I used your result in a fully realized enhanced `describe()` containing the `Nulls` column: see my answer below – WestCoastProjects Jul 31 '19 at 18:30

Vivek Solanki · Answer 2 · 2019-07-31T18:04:54.477

1

You can try: dataframe.info()

As mentioned in the docs, df.info()gives you information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Based on your requirement to store the info, you can try following:

import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.txt", "w", encoding="utf-8") as f:  
     f.write(s)

Source: df.info() docs

edited Jul 31 '19 at 18:04

answered Jul 31 '19 at 17:55

Vivek Solanki

452
1
7
10

`info()` prints to console . While this approach is helpful for casual evaluation it is less so when the goal is to store the summary stats in an in memory data structure. I will upvote since it is useful for the former use case but I am still looking for an approach to capture the na counts in a data structure. It may be that combining `df.describe()` with `df.counts()` will do the trick: i am working on that now. Oh nmd @QuangHoang has a little bit better answer – WestCoastProjects Jul 31 '19 at 17:59
Your updated answer just allows storing the result as a string blob. It is not a data structure. – WestCoastProjects Jul 31 '19 at 18:05
Agree. I will leave it as it is if someone in future wants to store it as a string blob. – Vivek Solanki Jul 31 '19 at 18:09

WestCoastProjects · Answer 3 · 2019-07-31T18:34:56.983

Include the Nulls Counts with describe()

The following provides the full realization of my original intent to add the nulls column to the information provided by dataframe.describe(). Credit to @QuangHoang for mentioning the dataframe.isna().sum() that forms part of this answer.

Notice that we have to transpose the output from describe(). The Nulls is then pre-pended to the transposed describe() output and the column is renamed to Nulls via set_axis:

df = pd.DataFrame({ 'a': [1,2,3], 'b': ['a','b','c'], 'c': [99.5,11.2, 433.1],
   'd':[123,'abc',None]})
desc = df.describe()  # Returns a DataFrame with stats in the row index
combo = pd.concat([df.isna().sum(),desc.T],axis=1)
          .set_axis(['Nulls']+list(desc.index),axis=1,inplace=False)

How to show counts of null string-like column values in pandas

3 Answers3

Include the Nulls Counts with describe()