0

Ever found yourself wondering what rows are in Pandas DataFrame A only, but not in B (or vice versa), and what rows are in A, as well as in B (or vice versa)?

While there exists pandas.DataFrame.diff() (more like an element-by-element subtraction), pandas.Index.intersection (only work on the index) and pandas.DataFrame.merge() (more like SQL join to merge two DataFrames), none of them does exactly what we need here...

Frank Wang
  • 381
  • 3
  • 7

1 Answers1

1

Therefore I wrote these two functions df_diff() and df_overlap() to do exactly that -- see the visual demo and the code below.

Great things about them:

  • Can compare on any column (not only on the index - default to compare on index if omit arguments on_A and/or on_B)
  • Can compare on different column combination (like column_this in DataFrame A and column_that in DataFrame B. You can also insert concatenated id columns on your own and then compare on them)
  • Can handle duplicated/missing ids correctly
  • They are mainly based on set operations -- super fast
  • Has error check/handling (will safely return empty DataFrame if found no result or got invalid input)

To use the function(s), simply copy them from below, or visit my GitHub repo for a full use case!

enter image description here

import pandas as pd

def df_diff(df_A: pd.DataFrame, df_B: pd.DataFrame, on_A: str = "", on_B: str = "") -> pd.DataFrame:
    """
    Function: Compare DataFrame "A" and "B" to find rows only in "A" but not in "B"
    Input:
        df_A: DataFrame "A" ("left table")
        df_B: DataFrame "B" ("right table")
        on_A: column name in DataFrame "A" to compare on. If not provided/valid, will default to using df_A's index
        on_B: column name in DataFrame "B" to compare on. If not provided/valid, will default to using df_B's index
    
    Output:
        DataFrame containing diff result (all rows only in df_A but not in df_B, and same columns as df_A)
        If find zero rows, will return a DataFrame of 0 row and same columns as df_A (can be checked by `df_output.empty and df_output.shape[1] != 0`)
        If input is not valid DataFrame, will return a DataFrame of 0 row and 0 column (can be checked by `df_output.empty and df_output.shape[1] == 0`)
    
    Dependency: `import pandas as pd`
    History: 2022-02-07 Developed by frank-yifei-wang@GitHub
    """
    
    if type(df_A) != pd.core.frame.DataFrame or type(df_B) != pd.core.frame.DataFrame: 
        return pd.DataFrame()
    
    if on_A != "" and on_A in df_A.columns: 
        id_col_A = df_A[on_A]
    else:
        id_col_A = df_A.index
    if on_B != "" and on_B in df_B.columns: 
        id_col_B = df_B[on_B]
    else:
        id_col_B = df_B.index
    
    id_set_A = set(id_col_A)
    id_set_B = set(id_col_B)

    id_set_diff = id_set_A.difference(id_set_B)
    df_output = df_A[id_col_A.isin(id_set_diff)].copy()

    return df_output

def df_overlap(df_A: pd.DataFrame, df_B: pd.DataFrame, on_A: str = "", on_B: str = "") -> pd.DataFrame:
    """
    Function: Compare DataFrame "A" and "B" to find rows in "A" and also in "B"
    Input:
        df_A: DataFrame "A" ("left table")
        df_B: DataFrame "B" ("right table")
        on_A: column name in DataFrame "A" to compare on. If not provided/valid, will default to using df_A's index
        on_B: column name in DataFrame "B" to compare on. If not provided/valid, will default to using df_B's index
    
    Output:
        DataFrame containing overlap result (all rows in df_A and also in df_B, and same columns as df_A)
        Note: result of df_overlap(df_A, df_B) (= a slice of df_A) is different from df_overlap(df_B, df_A) (= a slice of df_B)
        If find zero rows, will return a DataFrame of 0 row and same columns as df_A (can be checked by `df_output.empty and df_output.shape[1] != 0`)
        If input is not valid DataFrame, will return a DataFrame of 0 row and 0 column (can be checked by `df_output.empty and df_output.shape[1] == 0`)
    
    Dependency: `import pandas as pd`
    History: 2022-02-07 Developed by frank-yifei-wang@GitHub
    """
    
    if type(df_A) != pd.core.frame.DataFrame or type(df_B) != pd.core.frame.DataFrame: 
        return pd.DataFrame()
    
    if on_A != "" and on_A in df_A.columns: 
        id_col_A = df_A[on_A]
    else:
        id_col_A = df_A.index
    if on_B != "" and on_B in df_B.columns: 
        id_col_B = df_B[on_B]
    else:
        id_col_B = df_B.index
    
    id_set_A = set(id_col_A)
    id_set_B = set(id_col_B)

    id_set_overlap = id_set_A.intersection(id_set_B)
    df_output = df_A[id_col_A.isin(id_set_overlap)].copy()

    return df_output
Frank Wang
  • 381
  • 3
  • 7