5

I have a large data frame with 85 columns. The missing data has been coded as NaN. My goal is to get the amount of missing data in each column. So I wrote a for loop to create a list to get the amounts. But it does not work.

The followings are my codes:

headers = x.columns.values.tolist() 
nans=[]
for head in headers:
    nans_col = x[x.head == 'NaN'].shape[0]
    nan.append(nans_col)

I tried to use the codes in the loop to generate the amount of missing value for a specific column by changing head to that column's name, then the code works and gave me the amount of missing data in that column.

So I do not know how to correct the for loop codes. Is somebody kind to help me with this? I highly appreciate your help.

Karl
  • 1,664
  • 2
  • 12
  • 19
vivian
  • 63
  • 1
  • 1
  • 7
  • You've compared the entry to the string `'NaN`, which is not even the data type you need. Look up the `isnan` function and, ingeneral, how to detect `NaN` values. – Prune Oct 18 '18 at 00:35
  • @Prune Thanks for your comments! I coded missing data as np.nan. Then isnull() works to find missing data. – vivian Oct 18 '18 at 03:55

5 Answers5

10

For columns in pandas (python data analysis library) you can use:

In [3]: import numpy as np
In [4]: import pandas as pd
In [5]: df = pd.DataFrame({'a':[1,2,np.nan], 'b':[np.nan,1,np.nan]})
In [6]: df.isnull().sum()
Out[6]:
a    1
b    2
dtype: int64

For a single column or for sereis you can count the missing values as shown below:

In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: s = pd.Series([1,2,3, np.nan, np.nan])

In [4]: s.isnull().sum()
Out[4]: 2

Reference

1

This gives you a count (by column name) of the number of values missing (printed as True followed by the count)

missing_data = df.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print(missing_data[column].value_counts())
    print("")
bbarnes8
  • 11
  • 1
1

Just use Dataframe.info, and non-null count is probably what you want and more.

>>> pd.DataFrame({'a':[1,2], 'b':[None, None], 'c':[3, None]}) \
.info(verbose=True, null_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2 entries, 0 to 1
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   a       2 non-null      int64    
 1   b       0 non-null      object
 2   c       1 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 176.0+ bytes
B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178
  • if you're getting `'Series' object has no attribute 'info'` for a single column, try this `df['a'].isna().sum()` – PatrickT Nov 14 '21 at 07:39
0

If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage

Missing Data Analysis

def miss_data(df):
    x = ['column_name','missing_data', 'missing_in_percentage']
    missing_data = pd.DataFrame(columns=x)
    columns = df.columns
    for col in columns:
        icolumn_name = col
        imissing_data = df[col].isnull().sum()
        imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100
        
        missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
    print(missing_data) 
Community
  • 1
  • 1
GpandaM
  • 47
  • 4
  • stumbled across this function, was looking for something like this, not working for me. – Ricky Sep 07 '21 at 10:18
0
#function to show the nulls total values per column
colum_name = np.array(data.columns.values)
def iter_columns_name(colum_name):
  for k in colum_name:
    print("total nulls {}=".format(k),pd.isnull(data[k]).values.ravel().sum())

#call the function
iter_columns_name(colum_name)

#outout
total nulls start_date= 0
total nulls end_date= 0
total nulls created_on= 0
total nulls lat= 9925
.
.
.