79
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
percent= 100*(len(df.loc[:,df.isnull().sum(axis=0)>=1 ].index) / len(df.index))
print(round(percent,2))

input is https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0

and the output should be

Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.24
Discount               0.65
Order_Quantity         0.65
Profit                 0.65
Shipping_Cost          0.65
Product_Base_Margin    1.30
dtype: float64
Prune
  • 76,765
  • 14
  • 60
  • 81
Shaswata
  • 1,049
  • 1
  • 9
  • 27

15 Answers15

120

How about this? I think I actually found something similar on here once before, but I'm not seeing it now...

percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
                                 'percent_missing': percent_missing})

And if you want the missing percentages sorted, follow the above with:

missing_value_df.sort_values('percent_missing', inplace=True)

As mentioned in the comments, you may also be able to get by with just the first line in my code above, i.e.:

percent_missing = df.isnull().sum() * 100 / len(df)
Engineero
  • 12,340
  • 5
  • 53
  • 75
  • I don't think you need the first line and the last line. The middle line produces the result he wants – user3483203 Jun 27 '18 at 20:39
  • I don't know, the output he shows looks like a copied and pasted pandas dataframe itself, thus I'm building a dataframe from the existing one's columns and their percent missing. But we'll see what the OP says. – Engineero Jun 27 '18 at 20:40
  • its not printing the dtype float 64 at the last – Shaswata Jun 27 '18 at 20:45
  • @Shaswata yeah, looks like you can get the output from just doing the `df.isnull().sum()...` after all. Added that to my answer as it was pointed out in the comments. – Engineero Jun 27 '18 at 20:49
57

Update let's use mean with isnull:

df.isnull().mean() * 100

Output:

Ord_id                 0.000000
Prod_id                0.000000
Ship_id                0.000000
Cust_id                0.000000
Sales                  0.238124
Discount               0.654840
Order_Quantity         0.654840
Profit                 0.654840
Shipping_Cost          0.654840
Product_Base_Margin    1.297774
dtype: float64

IIUC:

df.isnull().sum() / df.shape[0] * 100.00

Output:

Ord_id                 0.000000
Prod_id                0.000000
Ship_id                0.000000
Cust_id                0.000000
Sales                  0.238124
Discount               0.654840
Order_Quantity         0.654840
Profit                 0.654840
Shipping_Cost          0.654840
Product_Base_Margin    1.297774
dtype: float64
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
13

single line solution

df.isnull().mean().round(4).mul(100).sort_values(ascending=False)
bitbang
  • 1,804
  • 14
  • 18
  • 1
    You can also turn this into a data frame for further manipulation if required (still a one-liner): ```pd.DataFrame(df.isnull().mean().round(4).mul(100).sort_values(ascending=False), columns=['Percentage Missing'])``` – Ricky McMaster Jan 18 '22 at 09:35
7

To cover all missing values and round the results:

((df.isnull() | df.isna()).sum() * 100 / df.index.size).round(2)

The output:

Out[556]: 
Ord_id                 0.00
Prod_id                0.00
Ship_id                0.00
Cust_id                0.00
Sales                  0.24
Discount               0.65
Order_Quantity         0.65
Profit                 0.65
Shipping_Cost          0.65
Product_Base_Margin    1.30
dtype: float64
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105
  • 1
    `isnull` and `isna` are aliases, as far as I can tell – IanS Jun 28 '18 at 14:57
  • @IanS, yes, at the moment. But in case if they are interchangeable, one of them could be somehow extended in future versions or one of them could be removed. So, in one case we have an advanced action, in 2nd case we have a potential indicating point showing which function is actual. – RomanPerekhrest Jun 28 '18 at 16:44
3

The solution you're looking for is :

round(df.isnull().mean()*100,2) 

This will round up the percentage upto 2 decimal places

Another way to do this is

round((df.isnull().sum()*100)/len(df),2)

but this is not efficient as using mean() is.

2
import numpy as np
import pandas as pd

raw_data = {'first_name': ['Jason', np.nan, 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', np.nan, np.nan, 'Milner', 'Cooze'], 
        'age': [22, np.nan, 23, 24, 25], 
        'sex': ['m', np.nan, 'f', 'm', 'f'], 
        'Test1_Score': [4, np.nan, 0, 0, 0],
        'Test2_Score': [25, np.nan, np.nan, 0, 0]}
results = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'sex', 'Test1_Score', 'Test2_Score'])


results 

  first_name last_name   age  sex  Test1_Score  Test2_Score
0      Jason    Miller  22.0    m          4.0         25.0
1        NaN       NaN   NaN  NaN          NaN          NaN
2       Tina       NaN  23.0    f          0.0          NaN
3       Jake    Milner  24.0    m          0.0          0.0
4        Amy     Cooze  25.0    f          0.0          0.0

You can use following function, which will give you output in Dataframe

  • Zero Values
  • Missing Values
  • % of Total Values
  • Total Zero Missing Values
  • % Total Zero Missing Values
  • Data Type

Just copy and paste following function and call it by passing your pandas Dataframe

def missing_zero_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(
        columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Values'})
        mz_table['Total Zero Missing Values'] = mz_table['Zero Values'] + mz_table['Missing Values']
        mz_table['% Total Zero Missing Values'] = 100 * mz_table['Total Zero Missing Values'] / len(df)
        mz_table['Data Type'] = df.dtypes
        mz_table = mz_table[
            mz_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
            "There are " + str(mz_table.shape[0]) +
              " columns that have missing values.")
#         mz_table.to_excel('D:/sampledata/missing_and_zero_values.xlsx', freeze_panes=(1,0), index = False)
        return mz_table

missing_zero_values_table(results)

Output

Your selected dataframe has 6 columns and 5 Rows.
There are 6 columns that have missing values.

             Zero Values  Missing Values  % of Total Values  Total Zero Missing Values  % Total Zero Missing Values Data Type
last_name              0               2               40.0                          2                         40.0    object
Test2_Score            2               2               40.0                          4                         80.0   float64
first_name             0               1               20.0                          1                         20.0    object
age                    0               1               20.0                          1                         20.0   float64
sex                    0               1               20.0                          1                         20.0    object
Test1_Score            3               1               20.0                          4                         80.0   float64

If you want to keep it simple then you can use following function to get missing values in %

def missing(dff):
    print (round((dff.isnull().sum() * 100/ len(dff)),2).sort_values(ascending=False))


missing(results)

Test2_Score    40.0
last_name      40.0
Test1_Score    20.0
sex            20.0
age            20.0
first_name     20.0
dtype: float64
Suhas_Pote
  • 3,620
  • 1
  • 23
  • 38
  • In the top larger table is there a way to have a dataframe of all columns so anyone that has no issues is still listed ? – Tinkinc Mar 21 '22 at 19:04
2

One-liner

I'm wondering nobody takes advantage of the size and count? It seems the shortest (and probably fastest) way to do it.

df.apply(lambda x: 1-(x.count()/x.size))

Resulting in:

Ord_id                 0.000000
Prod_id                0.000000
Ship_id                0.000000
Cust_id                0.000000
Sales                  0.002381
Discount               0.006548
Order_Quantity         0.006548
Profit                 0.006548
Shipping_Cost          0.006548
Product_Base_Margin    0.012978
dtype: float64

If you find any reason why this is not a good way, please comment

tturbo
  • 680
  • 5
  • 16
1

If there are multiple dataframe below is the function to calculate number of missing value in each column with percentage

def miss_data(df):
    x = ['column_name','missing_data', 'missing_in_percentage']
    missing_data = pd.DataFrame(columns=x)
    columns = df.columns
    for col in columns:
        icolumn_name = col
        imissing_data = df[col].isnull().sum()
        imissing_in_percentage = (df[col].isnull().sum()/df[col].shape[0])*100

        missing_data.loc[len(missing_data)] = [icolumn_name, imissing_data, imissing_in_percentage]
    print(missing_data) 
GpandaM
  • 47
  • 4
1

By this following code, you can get the corresponding percentage values from every columns. Just switch the name train_data with df, in case of yours.

Input:

In [1]:

all_data_na = (train_data.isnull().sum() / len(train_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio' :all_data_na})
missing_data.head(20)

Output :

Out[1]: 
                                Missing Ratio
 left_eyebrow_outer_end_x       68.435239
 left_eyebrow_outer_end_y       68.435239
 right_eyebrow_outer_end_y      68.279189
 right_eyebrow_outer_end_x      68.279189
 left_eye_outer_corner_x        67.839410
 left_eye_outer_corner_y        67.839410
 right_eye_inner_corner_x       67.825223
 right_eye_inner_corner_y       67.825223
 right_eye_outer_corner_x       67.825223
 right_eye_outer_corner_y       67.825223
 mouth_left_corner_y            67.811037
 mouth_left_corner_x            67.811037
 left_eyebrow_inner_end_x       67.796851
 left_eyebrow_inner_end_y       67.796851
 right_eyebrow_inner_end_y      67.796851
 mouth_right_corner_x           67.796851
 mouth_right_corner_y           67.796851
 right_eyebrow_inner_end_x      67.796851
 left_eye_inner_corner_x        67.782664
 left_eye_inner_corner_y        67.782664
naimur978
  • 144
  • 8
0

For me I did it like that :

def missing_percent(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_table = pd.concat([mis_val, mis_percent], axis=1)
        
        # Rename the columns
        mis_columns = mis_table.rename(
        columns = {0 : 'Missing Values', 1 : 'Percent of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_columns = mis_columns[
            mis_columns.iloc[:,1] != 0].sort_values(
        'Percent of Total Values', ascending=False).round(2)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_columns
Dharman
  • 30,962
  • 25
  • 85
  • 135
Salma Elshahawy
  • 1,112
  • 2
  • 11
  • 21
0

Let's break down your ask

  1. you want the percentage of missing value
  2. it should be sorted in ascending order and the values to be rounded to 2 floating point

Explanation:

  1. dhr[fill_cols].isnull().sum() - gives the total number of missing values column wise
  2. dhr.shape[0] - gives the total number of rows
  3. (dhr[fill_cols].isnull().sum()/dhr.shape[0]) - gives you a series with percentage as values and column names as index
  4. since the output is a series you can round and sort based on the values

code:

(dhr[fill_cols].isnull().sum()/dhr.shape[0]).round(2).sort_values()

Reference: sort, round

0
import numpy as np

import pandas as pd

df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')

df.loc[np.isnan(df['Product_Base_Margin']),['Product_Base_Margin']]=df['Product_Base_Margin'].mean()

print(round(100*(df.isnull().sum()/len(df.index)), 2))
flaxel
  • 4,173
  • 4
  • 17
  • 30
0

Try this solution


import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
print(round(100*(df.isnull().sum()/len(df.index)),2))

0

The best solution I have found - (Only shows the missing columns)

missing_values = [feature for feature in df.columns if df[feature].isnull().sum() > 1]

for feature in missing_values:
  print(f"{feature} {np.round(df[feature].isnull().mean(), 4)}% missing values")
0
import pandas as pd
df = pd.read_csv('https://query.data.world/s/Hfu_PsEuD1Z_yJHmGaxWTxvkz7W_b0')
df.isna().sum()

Output:

Ord_id                   0
Prod_id                  0
Ship_id                  0
Cust_id                  0
Sales                   20
Discount                55
Order_Quantity          55
Profit                  55
Shipping_Cost           55
Product_Base_Margin    109
dtype: int64

df.shape

Output: (8399, 10)

# for share [0; 1] of nan in each column

df.isna().sum() / df.shape[0]

Output:

Ord_id                0.0000
Prod_id               0.0000
Ship_id               0.0000
Cust_id               0.0000
Sales                 0.0024  # (20  / 8399)
Discount              0.0065  # (55  / 8399)
Order_Quantity        0.0065  # (55  / 8399)
Profit                0.0065  # (55  / 8399)
Shipping_Cost         0.0065  # (55  / 8399)
Product_Base_Margin   0.0130  # (109 / 8399)
dtype: float64

# for percent [0; 100] of nan in each column

df.isna().sum() / (df.shape[0] / 100)

Output:

Ord_id                0.0000
Prod_id               0.0000
Ship_id               0.0000
Cust_id               0.0000
Sales                 0.2381  # (20  / (8399 / 100))
Discount              0.6548  # (55  / (8399 / 100))
Order_Quantity        0.6548  # (55  / (8399 / 100))
Profit                0.6548  # (55  / (8399 / 100))
Shipping_Cost         0.6548  # (55  / (8399 / 100))
Product_Base_Margin   1.2978  # (109 / (8399 / 100))
dtype: float64

# for share [0; 1] of nan in dataframe

df.isna().sum() / (df.shape[0] * df.shape[1])

Output:

Ord_id                0.0000
Prod_id               0.0000
Ship_id               0.0000
Cust_id               0.0000
Sales                 0.0002  # (20  / (8399 * 10))
Discount              0.0007  # (55  / (8399 * 10))
Order_Quantity        0.0007  # (55  / (8399 * 10))
Profit                0.0007  # (55  / (8399 * 10))
Shipping_Cost         0.0007  # (55  / (8399 * 10))
Product_Base_Margin   0.0013  # (109 / (8399 * 10))
dtype: float64

# for percent [0; 100] of nan in dataframe

df.isna().sum() / ((df.shape[0] * df.shape[1]) / 100)

Output:

Ord_id                0.0000
Prod_id               0.0000
Ship_id               0.0000
Cust_id               0.0000
Sales                 0.0238  # (20  / ((8399 * 10) / 100))
Discount              0.0655  # (55  / ((8399 * 10) / 100))
Order_Quantity        0.0655  # (55  / ((8399 * 10) / 100))
Profit                0.0655  # (55  / ((8399 * 10) / 100))
Shipping_Cost         0.0655  # (55  / ((8399 * 10) / 100))
Product_Base_Margin   0.1298  # (109 / ((8399 * 10) / 100))
dtype: float64
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 13 '22 at 10:36