MCAR Little's test in Python

Question

How can I execute Little's Test, to find MCAR in Python? I have looked at the R package for the same test, but I want to do it in Python. Is there an alternate approach to test MCAR?

What about `impyute` library? Little’s MCAR Test (WIP) is in its feature list. — Istrel, Sep 28 '19 at 10:35
@Istrel impyute library does not explain how to do it (as far as I have seen), can you elaborate steps or give link for proper documentation. — Kiran, Oct 13 '19 at 09:33
The impyute library has a ticket to implement Little's MCAR Test, but it's not in progress: https://github.com/eltonlaw/impyute/issues/71 — skeller88, Feb 26 '20 at 03:16

Akis Hadjimpalasis · Answer 1 · 2022-05-06T15:23:07.390

You can use rpy2 to get the mcar test from R. Note that using rpy2 requires some R coding.

Set up rpy2 in Google Colab

# rpy2 libraries
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import globalenv

# Import R's base package
base = importr("base")

# Import R's utility packages
utils = importr("utils")

# Select mirror 
utils.chooseCRANmirror(ind=1)

# For automatic translation of Pandas objects to R
pandas2ri.activate()

# Enable R magic
%load_ext rpy2.ipython

# Make your Pandas dataframe accessible to R
globalenv["r_df"] = df

You can now get R functionality within your Python environment by using R magics. Use %R for a single line of R code and %%R when the whole cell should be interpreted as R code.

To install an R package use: utils.install_packages("package_name")

You may also need to load it before it can be used: %R library(package_name)

For the Little's MCAR test, we should install the naniar package. Its installation is slightly more complicated as we also need to install remotes to download it from github, but for other packages the general procedure should be enough.

utils.install_packages("remotes")
%R remotes::install_github("njtierney/naniar")

Load naniar package:

%R library(naniar)

Pass your r_df to the mcar_test function:

# mcar_test on whole df
%R mcar_test(r_df)

If an error occurs, try including only the columns with missing data:

%%R
# mcar_test on columns with missing data
r_dfMissing <- r_df[c("col1", "col2", "col3")]
mcar_test(r_dfMissing)

Nice. Can you put a few words on why you would include only variables with missing data? I thought the idea was to assess differences in variables grouped by missing/non-missing, which I cannot imagine will work if we drop cols without missing. — Johan, Jun 18 '23 at 13:15
That's a good question. The only reason I suggested including variables with missing data is because the mcar_test() function raises an error. I am not sure if this happens in every situation or just with the data I tried it with. — Akis Hadjimpalasis, Aug 23 '23 at 07:18

score 2 · Answer 2 · answered May 14 '23 at 11:55

you can simply use this function to do a Little's MCAR test, instead of using R code:

import numpy as np
import pandas as pd
from scipy.stats import chi2

def little_mcar_test(data, alpha=0.05):
    """
    Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
    
    Parameters:
    data (DataFrame): A pandas DataFrame with n observations and p variables, where some values are missing.
    alpha (float): The significance level for the hypothesis test (default is 0.05).
    
    Returns:
    A tuple containing:
    - A matrix of missing values that represents the pattern of missingness in the dataset.
    - A p-value representing the significance of the MCAR test.
    """
    
    # Calculate the proportion of missing values in each variable
    p_m = data.isnull().mean()
    
    # Calculate the proportion of complete cases for each variable
    p_c = data.dropna().shape[0] / data.shape[0]
    
    # Calculate the correlation matrix for all pairs of variables that have complete cases
    R_c = data.dropna().corr()
    
    # Calculate the correlation matrix for all pairs of variables using all observations
    R_all = data.corr()
    
    # Calculate the difference between the two correlation matrices
    R_diff = R_all - R_c
    
    # Calculate the variance of the R_diff matrix
    V_Rdiff = np.var(R_diff, ddof=1)
    
    # Calculate the expected value of V_Rdiff under the null hypothesis that the missing data is MCAR
    E_Rdiff = (1 - p_c) / (1 - p_m).sum()
    
    # Calculate the test statistic
    T = np.trace(R_diff) / np.sqrt(V_Rdiff * E_Rdiff)
    
    # Calculate the degrees of freedom
    df = data.shape[1] * (data.shape[1] - 1) / 2
    
    # Calculate the p-value using a chi-squared distribution with df degrees of freedom and the test statistic T
    p_value = 1 - chi2.cdf(T ** 2, df)
    
    # Create a matrix of missing values that represents the pattern of missingness in the dataset
    missingness_matrix = data.isnull().astype(int)
    
    # Return the missingness matrix and the p-value
    return missingness_matrix, p_value

Cool. What df do you expect as input? And I thought Little's test should return one test with one p-value, not one per column. — Johan, Jun 18 '23 at 13:25

score 0 · Answer 3 · answered Jun 18 '23 at 08:41

Comments suggest using existing packages. Here is an example directly taken from pyampute:

import pandas as pd
from pyampute.exploration.mcar_statistical_tests import MCARTest
data_mcar = pd.read_table("data/missingdata_mcar.csv")
mt = MCARTest(method="little")
print(mt.little_mcar_test(data_mcar))
0.17365464213775494

Tamunoala · Answer 4 · 2023-06-29T13:51:42.480

import numpy as np
import pandas as pd
from scipy.stats import chi2

def little_mcar_test(data, alpha=0.05):
    """
    Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
    """
    data = pd.DataFrame(data)
    data.columns = ['x' + str(i) for i in range(data.shape[1])]
    data['missing'] = np.sum(data.isnull(), axis=1)
    n = data.shape[0]
    k = data.shape[1] - 1
    df = k * (k - 1) / 2
    chi2_crit = chi2.ppf(1 - alpha, df)
    chi2_val = ((n - 1 - (k - 1) / 2) ** 2) / (k - 1) / ((n - k) * np.mean(data['missing']))
    p_val = 1 - chi2.cdf(chi2_val, df)
    if chi2_val > chi2_crit:
        print(
            'Reject null hypothesis: Data is not MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
        )
    else:
        print(
            'Do not reject null hypothesis: Data is MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
        )

As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). — Community, Jul 01 '23 at 06:10

MCAR Little's test in Python

4 Answers4

Linked