1

The dataframe.describe() has the following columns for string like columns:

count unique top freq first last

enter image description here

While these are certainly useful it is also v important to understand if there were null values in any given columns and how many.

While I could resort to writing custom function to find this it would be a significant additional overhead. Note that there is a related question but that focuses on numeric columns and is thus not directly applicable: How to count the NaN values in a column in pandas DataFrame. So is there another helper function that can provide that additional information?

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560

3 Answers3

3

For a quick glance of number of nan in each columns:

dataframe.isna().sum()
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • 1
    actually this is not just a "quick glance" since a fully structured dataframe is returned. I need that for in-memory data structure. I will probably join this result to the dataframe returned by `pd.describe()` to give a complete one-place summary – WestCoastProjects Jul 31 '19 at 18:04
  • I used your result in a fully realized enhanced `describe()` containing the `Nulls` column: see my answer below – WestCoastProjects Jul 31 '19 at 18:30
1

You can try: dataframe.info()

As mentioned in the docs, df.info()gives you information about a DataFrame including the index dtype and column dtypes, non-null values and memory usage.

Based on your requirement to store the info, you can try following:

import io
buffer = io.StringIO()
df.info(buf=buffer)
s = buffer.getvalue()
with open("df_info.txt", "w", encoding="utf-8") as f:  
     f.write(s)

Source: df.info() docs

Vivek Solanki
  • 452
  • 1
  • 7
  • 10
  • `info()` prints to console . While this approach is helpful for casual evaluation it is less so when the goal is to store the summary stats in an in memory data structure. I will upvote since it is useful for the former use case but I am still looking for an approach to capture the na counts in a data structure. It may be that combining `df.describe()` with `df.counts()` will do the trick: i am working on that now. Oh nmd @QuangHoang has a little bit better answer – WestCoastProjects Jul 31 '19 at 17:59
  • Your updated answer just allows storing the result as a string blob. It is not a data structure. – WestCoastProjects Jul 31 '19 at 18:05
  • Agree. I will leave it as it is if someone in future wants to store it as a string blob. – Vivek Solanki Jul 31 '19 at 18:09
1

Include the Nulls Counts with describe()

The following provides the full realization of my original intent to add the nulls column to the information provided by dataframe.describe(). Credit to @QuangHoang for mentioning the dataframe.isna().sum() that forms part of this answer.

Notice that we have to transpose the output from describe(). The Nulls is then pre-pended to the transposed describe() output and the column is renamed to Nulls via set_axis:

df = pd.DataFrame({ 'a': [1,2,3], 'b': ['a','b','c'], 'c': [99.5,11.2, 433.1],
   'd':[123,'abc',None]})
desc = df.describe()  # Returns a DataFrame with stats in the row index
combo = pd.concat([df.isna().sum(),desc.T],axis=1)
          .set_axis(['Nulls']+list(desc.index),axis=1,inplace=False)

enter image description here

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560