1

In my df below, I want to :

  1. identify and flag the outliers in col_E using z-scores
  2. separately explain how to identify and flag the outliers using z-scores in two or more columns, for example col_D & col_E

See below for the dataset

import pandas as pd
from scipy import stats
  
# intialise data of lists
df = { 
         'col_A':['P0', 'P1', 'P2', 'P4', 'P5'],
         'col_B':[1,1,1,1,1],
         'col_C':[1,2,3,5,9],
         'col_D':[120.05, 181.90, 10.34, 153.10, 311.17],
        'col_E':[110.21, 191.12, 190.21, 12.00, 245.09 ],
        'col_F':[100.22,199.10, 191.13,199.99, 255.19],
        'col_G':[140.29, 291.07, 390.22, 245.09, 4122.62],
        
        }
  
# Create DataFrame
df = pd.DataFrame(df)
  
# Print the output.
df

Desired: flag all outliers in col_D first and then col_D and col_E secondly (Note: In my image below 10.34 and 12.00 were randomly highlighted)

Q1

enter image description here

Attempt:

#Q1
exclude_cols = ['col_A','col_B','col_C','col_D','col_F','col_G']
include_cols = ['col_E'] # desired column

def flag_outliers(s, exclude_cols):
    if s.name in exclude_cols:
        print(s.name)
        return '' 
    else:
        s=df[(np.abs(stats.zscore(df['col_E'])) > 3)] # not sure of this part of the code
        
        return ['background-color: yellow' if v else '' for v in indexes]

df.style.apply(lambda s: flag_outliers(s, exclude_cols), axis=1, subset=include_cols)



#Q2
exclude_cols = ['col_A','col_B','col_C','col_F','col_G']
include_cols = ['col_D','col_E'] # desired columns

def flag_outliers(s, exclude_cols):
    if s.name in exclude_cols:
        print(s.name)
        return '' 
    else:
        s=df[(np.abs(stats.zscore(df['col_E'])) > 3)] # not sure of this part of the code
        
        return ['background-color: yellow' if v else '' for v in indexes]

df.style.apply(lambda s: flag_outliers(s, exclude_cols), axis=1, subset=include_cols)

Thanks!

Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49
nasa313
  • 266
  • 2
  • 12

2 Answers2

2

I assume the following meanings to demonstrate a broader range of usage.

  • Q1 stands for calculating a single column
  • Q2 stands for calculating over multiple columns pooled together.

If Q2 is meant to calculated on multiple columns separately, then you can simply loop your Q1 solution over multiple columns, which should be trivial so I will omit such situation here.

Keys

  • Q1 is quite straightforward as one can return a list of values by list comprehension.
  • Q2 is a little bit complicated because the z-score would be applied over a DataFrame subset (i.e. axis=None must be used). According to the official docs, when applying style over a DataFrame, the returning object must also be a DataFrame with the same index and columns as the subset. This is what caused the reshaping and DataFrame construction artifacts.

Single Column (Q1)

Note that z=3 is lowered to 1.5 for demonstration purpose. # desired column include_cols = ['col_E']

# additional control
outlier_threshold = 1.5   # 3 won't work!
ddof = 0   # degree of freedom correction. Sample = 1 and population = 0.

def flag_outliers(s: pd.Series):
    outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
    # replace boolean values with corresponding strings
    return ['background-color: yellow' if val else '' for val in outlier_mask]

df.style.apply(flag_outliers, subset=include_cols)

Result

enter image description here

Multiple Column Pooled (Q2, Assumed)

Q2

include_cols = ['col_D', 'col_E']  # desired columns

outlier_threshold = 1.5
ddof = 0

def flag_outliers(s: pd.DataFrame) -> pd.DataFrame:
    outlier_mask = np.abs(stats.zscore(s.values.reshape(-1), axis=None, ddof=ddof)) > outlier_threshold
    # prepare the array of string to be returned
    arr = np.array(['background-color: yellow' if val else '' for val in outlier_mask], dtype=object).reshape(s.shape)
    # cast the array into dataframe
    return pd.DataFrame(arr, columns=s.columns, index=s.index)

df.style.apply(flag_outliers, axis=None, subset=include_cols)

Result

enter image description here

Bill Huang
  • 4,491
  • 2
  • 13
  • 31
0

Based on this answer, just pass the condition of the score to a dict storing the background color of each column index.

include_cols = ['col_D', 'col_E']

def color_outliers_yellow(row, include, color='yellow', z_score = 1):
    
    styles = {col: '' for col in row.index}
    
    if row.name in include:
        
        scores = stats.zscore(list(row))
        
        scores = [(f'background-color: {color}' if score > z_score else '') for score in scores]
        
        return {k:v for k, v in zip(styles.keys(), scores)}
    
    else:
        
        return styles

df.style.apply(lambda x: color_outliers_yellow(x, include=include_cols), axis=0)

Results in:

enter image description here

Miguel Trejo
  • 5,913
  • 5
  • 24
  • 49
  • I would love for someone to break down what this code is doing/how it works... For my situation I want to replace the 'outliers' rather than highlight them. – flashliquid Oct 12 '21 at 01:56
  • @flashliquid do you mean replacing values? could you provide more context? It seems that styling approach would not be ideal for your case – Miguel Trejo Oct 12 '21 at 16:57