Find out the percentage of missing values in each column in the given dataset

Question

import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / len(df.index))
print(round(percent,2))

input is https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0

and the output should be

Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.24
Discount               0.65
Order_Quantity         0.65
Profit                 0.65
Shipping_Cost          0.65
Product_Base_Margin    1.30
dtype: float64

You need to iterate over the columns, printing once per column. You haven't written a loop to do that. Where are you stuck? — Prune, Jun 27 '18 at 20:37
Related https://stackoverflow.com/questions/26266362/how-to-count-the-nan-values-in-a-column-in-pandas-dataframe/39734251 — user3483203, Jun 27 '18 at 20:38
Also @Prune iteration is the last thing he should be doing in `pandas` — user3483203, Jun 27 '18 at 20:38
@user3483203 Good point -- should be building a vector of results. — Prune, Jun 27 '18 at 20:40

Engineero · Accepted Answer · 2019-03-18T14:06:57.127

120

How about this? I think I actually found something similar on here once before, but I'm not seeing it now...

percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})

And if you want the missing percentages sorted, follow the above with:

missing_value_df.sort_values('percent_missing', inplace=True)

As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:

percent_missing = df.isnull().sum() * 100 / len(df)

edited Mar 18 '19 at 14:06

answered Jun 27 '18 at 20:37

Engineero

12,340
5
53
75

I don't think you need the first line and the last line. The middle line produces the result he wants – user3483203 Jun 27 '18 at 20:39
I don't know, the output he shows looks like a copied and pasted pandas dataframe itself, thus I'm building a dataframe from the existing one's columns and their percent missing. But we'll see what the OP says. – Engineero Jun 27 '18 at 20:40
its not printing the dtype float 64 at the last – Shaswata Jun 27 '18 at 20:45
@Shaswata yeah, looks like you can get the output from just doing the `df.isnull().sum()...` after all. Added that to my answer as it was pointed out in the comments. – Engineero Jun 27 '18 at 20:49

Scott Boston · Answer 2 · 2018-07-06T19:42:02.627

57

Update let's use mean with isnull:

df.isnull().mean() * 100

Output:

Ord_id                 0.000000
Prod_id                0.000000
Ship_id                0.000000
Cust_id                0.000000
Sales                  0.238124
Discount               0.654840
Order_Quantity         0.654840
Profit                 0.654840
Shipping_Cost          0.654840
Product_Base_Margin    1.297774
dtype: float64

IIUC:

df.isnull().sum() / df.shape[0] * 100.00

Output:

Ord_id                 0.000000
Prod_id                0.000000
Ship_id                0.000000
Cust_id                0.000000
Sales                  0.238124
Discount               0.654840
Order_Quantity         0.654840
Profit                 0.654840
Shipping_Cost          0.654840
Product_Base_Margin    1.297774
dtype: float64

edited Jul 06 '18 at 19:42

answered Jun 27 '18 at 20:41

Scott Boston

147,308
15
139
187

2

How to drop the columns whose missing percentages are more than 30%? – ah bon Dec 04 '19 at 12:41
4

@Ahbon you can use the following: `df.loc[:, df.isnull().mean() < .30]` – Scott Boston Dec 04 '19 at 14:04
@PeJota These are percentages not fractions. You can have up to 100% NaN. This just means that a little more than 1% of this column has NaNs. – Scott Boston Jun 28 '22 at 18:53

bitbang · Answer 3 · 2021-01-07T14:06:48.560

13

single line solution

df.isnull().mean().round(4).mul(100).sort_values(ascending=False)

edited Jan 07 '21 at 14:06

answered Jan 06 '21 at 14:08

bitbang

1,804
14
18

1

You can also turn this into a data frame for further manipulation if required (still a one-liner): ```pd.DataFrame(df.isnull().mean().round(4).mul(100).sort_values(ascending=False), columns=['Percentage Missing'])``` – Ricky McMaster Jan 18 '22 at 09:35

score 7 · Answer 4 · answered Jun 27 '18 at 20:49

7

To cover all missing values and round the results:

((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)

The output:

Out[556]: 
Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.24
Discount               0.65
Order_Quantity         0.65
Profit                 0.65
Shipping_Cost          0.65
Product_Base_Margin    1.30
dtype: float64

answered Jun 27 '18 at 20:49

RomanPerekhrest

88,541
4
65
105

1

`isnull` and `isna` are aliases, as far as I can tell – IanS Jun 28 '18 at 14:57
@IanS, yes, at the moment. But in case if they are interchangeable, one of them could be somehow extended in future versions or one of them could be removed. So, in one case we have an advanced action, in 2nd case we have a potential indicating point showing which function is actual. – RomanPerekhrest Jun 28 '18 at 16:44

score 3 · Answer 5 · answered Jul 03 '19 at 13:00

3

The solution you're looking for is :

round(df.isnull().mean()*100,2)

This will round up the percentage upto 2 decimal places

Another way to do this is

round((df.isnull().sum()*100)/len(df),2)

but this is not efficient as using mean() is.

answered Jul 03 '19 at 13:00

Nitish Arora

31
1

score 2 · Answer 6 · answered Jun 19 '19 at 14:35

import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])


results 

  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0

You can use following function, which will give you output in Dataframe

Zero Values
Missing Values
% of Total Values
Total Zero Missing Values
% Total Zero Missing Values
Data Type

Just copy and paste following function and call it by passing your pandas Dataframe

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

Output

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

If you want to keep it simple then you can use following function to get missing values in %

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)

Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64

In the top larger table is there a way to have a dataframe of all columns so anyone that has no issues is still listed ? — Tinkinc, Mar 21 '22 at 19:04

score 2 · Answer 7 · answered Nov 21 '22 at 15:38

One-liner

I'm wondering nobody takes advantage of the size and count? It seems the shortest (and probably fastest) way to do it.

df.apply(lambda x: 1-(x.count()/x.size))

Resulting in:

Ord_id                 0.000000
Prod_id                0.000000
Ship_id                0.000000
Cust_id                0.000000
Sales                  0.002381
Discount               0.006548
Order_Quantity         0.006548
Profit                 0.006548
Shipping_Cost          0.006548
Product_Base_Margin    0.012978
dtype: float64

If you find any reason why this is not a good way, please comment

score 1 · Answer 8 · answered Dec 11 '19 at 11:25

If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage

def miss_data(df):
    x = ['column_name','missing_data', 'missing_in_percentage']
    missing_data = pd.DataFrame(columns=x)
    columns = df.columns
    for col in columns:
        icolumn_name = col
        imissing_data = df[col].isnull().sum()
        imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100

        missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
    print(missing_data)

naimur978 · Answer 9 · 2020-05-27T05:54:11.050

By this following code, you can get the corresponding percentage values from every columns. Just switch the name train_data with df, in case of yours.

Input:

In [1]:

all_data_na = (train_data.isnull().sum() / len(train_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

Output :

Out[1]: 
                                Missing Ratio
 left_eyebrow_outer_end_x       68.435239
 left_eyebrow_outer_end_y       68.435239
 right_eyebrow_outer_end_y      68.279189
 right_eyebrow_outer_end_x      68.279189
 left_eye_outer_corner_x        67.839410
 left_eye_outer_corner_y        67.839410
 right_eye_inner_corner_x       67.825223
 right_eye_inner_corner_y       67.825223
 right_eye_outer_corner_x       67.825223
 right_eye_outer_corner_y       67.825223
 mouth_left_corner_y            67.811037
 mouth_left_corner_x            67.811037
 left_eyebrow_inner_end_x       67.796851
 left_eyebrow_inner_end_y       67.796851
 right_eyebrow_inner_end_y      67.796851
 mouth_right_corner_x           67.796851
 mouth_right_corner_y           67.796851
 right_eyebrow_inner_end_x      67.796851
 left_eye_inner_corner_x        67.782664
 left_eye_inner_corner_y        67.782664

score 0 · Answer 10 · edited Jul 17 '20 at 17:35

For me I did it like that :

def missing_percent(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_table = pd.concat([mis_val, mis_percent], axis=1)
        
        # Rename the columns
        mis_columns = mis_table.rename(
        columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_columns = mis_columns[
            mis_columns.iloc[:,1] != 0].sort_values(
        'Percent of Total Values', ascending=False).round(2)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_columns

Aravind Raju · Answer 11 · 2020-10-14T07:19:15.673

Let's break down your ask

you want the percentage of missing value
it should be sorted in ascending order and the values to be rounded to 2 floating point

Explanation:

dhr[fill_cols].isnull().sum() - gives the total number of missing values column wise
dhr.shape[0] - gives the total number of rows
(dhr[fill_cols].isnull().sum()/dhr.shape[0]) - gives you a series with percentage as values and column names as index
since the output is a series you can round and sort based on the values

code:

(dhr[fill_cols].isnull().sum()/dhr.shape[0]).round(2).sort_values()

Reference: sort, round

score 0 · Answer 12 · edited Jan 09 '21 at 19:03

0

import numpy as np

import pandas as pd

df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')

df.loc[np.isnan(df['Product_Base_Margin']),['Product_Base_Margin']]=df['Product_Base_Margin'].mean()

print(round(100*(df.isnull().sum()/len(df.index)), 2))

edited Jan 09 '21 at 19:03

flaxel

4,173
4
17
30

answered Jan 09 '21 at 16:45

Malaya Parida

1
1

score 0 · Answer 13 · answered May 11 '21 at 16:03

0

Try this solution


import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
print(round(100*(df.isnull().sum()/len(df.index)),2))

answered May 11 '21 at 16:03

Sourabh Kulkarni

37
8

score 0 · Answer 14 · answered Oct 29 '21 at 13:21

The best solution I have found - (Only shows the missing columns)

missing_values = [feature for feature in df.columns if df[feature].isnull().sum() > 1]

for feature in missing_values:
  print(f"{feature} {np.round(df[feature].isnull().mean(), 4)}% missing values")

Deyvidas B · Answer 15 · 2022-10-08T16:01:19.907

import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df.isna().sum()

Output:

Ord_id                   0
Prod_id                  0
Ship_id                  0
Cust_id                  0
Sales                   20
Discount                55
Order_Quantity          55
Profit                  55
Shipping_Cost           55
Product_Base_Margin    109
dtype: int64

df.shape

Output: (8399, 10)

# for share [0; 1] of nan in each column

df.isna().sum() / df.shape[0]

Output:

Ord_id                0.0000
Prod_id               0.0000
Ship_id               0.0000
Cust_id               0.0000
Sales                 0.0024  # (20  / 8399)
Discount              0.0065  # (55  / 8399)
Order_Quantity        0.0065  # (55  / 8399)
Profit                0.0065  # (55  / 8399)
Shipping_Cost         0.0065  # (55  / 8399)
Product_Base_Margin   0.0130  # (109 / 8399)
dtype: float64

# for percent [0; 100] of nan in each column

df.isna().sum() / (df.shape[0] / 100)

Output:

Ord_id                0.0000
Prod_id               0.0000
Ship_id               0.0000
Cust_id               0.0000
Sales                 0.2381  # (20  / (8399 / 100))
Discount              0.6548  # (55  / (8399 / 100))
Order_Quantity        0.6548  # (55  / (8399 / 100))
Profit                0.6548  # (55  / (8399 / 100))
Shipping_Cost         0.6548  # (55  / (8399 / 100))
Product_Base_Margin   1.2978  # (109 / (8399 / 100))
dtype: float64

# for share [0; 1] of nan in dataframe

df.isna().sum() / (df.shape[0] * df.shape[1])

Output:

Ord_id                0.0000
Prod_id               0.0000
Ship_id               0.0000
Cust_id               0.0000
Sales                 0.0002  # (20  / (8399 * 10))
Discount              0.0007  # (55  / (8399 * 10))
Order_Quantity        0.0007  # (55  / (8399 * 10))
Profit                0.0007  # (55  / (8399 * 10))
Shipping_Cost         0.0007  # (55  / (8399 * 10))
Product_Base_Margin   0.0013  # (109 / (8399 * 10))
dtype: float64

# for percent [0; 100] of nan in dataframe

df.isna().sum() / ((df.shape[0] * df.shape[1]) / 100)

Output:

Ord_id                0.0000
Prod_id               0.0000
Ship_id               0.0000
Cust_id               0.0000
Sales                 0.0238  # (20  / ((8399 * 10) / 100))
Discount              0.0655  # (55  / ((8399 * 10) / 100))
Order_Quantity        0.0655  # (55  / ((8399 * 10) / 100))
Profit                0.0655  # (55  / ((8399 * 10) / 100))
Shipping_Cost         0.0655  # (55  / ((8399 * 10) / 100))
Product_Base_Margin   0.1298  # (109 / ((8399 * 10) / 100))
dtype: float64

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Oct 13 '22 at 10:36

Find out the percentage of missing values in each column in the given dataset

15 Answers15

One-liner

Linked