How to TDD with pandas and pytest?

Question

I have a Python script that consolidates reports by using Pandas all along in a sequence of DataFrame operations (drop, groupby, sum, etc). Let's say I start with a simple function that cleans all columns that has no values, it has a DataFrame as input and output:

# cei.py
def clean_table_cols(source_df: pd.DataFrame) -> pd.DataFrame:
   # IMPLEMENTATION
   # eg. return source_df.dropna(axis="columns", how="all")

I wanted to verify in my tests that this function actually removes all columns that all values are empty. So I arranged a test input and output, and test with assert_frame_equal function from pandas.testing:

# test_cei.py
import pandas as pd
def test_clean_table_cols() -> None:
    df = pd.DataFrame(
        {
            "full_valued": [1, 2, 3],
            "all_missing1": [None, None, None],
            "some_missing": [None, 2, 3],
            "all_missing2": [None, None, None],
        }
    )
    expected = pd.DataFrame({"full_valued": [1, 2, 3], "some_missing": [None, 2, 3]})
    result = cei.clean_table_cols(df)
    pd.testing.assert_frame_equal(result, expected)

My question is if it is conceptually a unit test or an e2e/integration test, since I am not mocking pandas implementation. But if I mock DataFrame, I won't be testing the functionality of the code. What is the recommended way to test this following TDD best practices?

Note: using Pandas in this project is a design decision, so there on purpose no intention to abstract Pandas interfaces to maybe replace it with other library in the future.

Mocking is not required in unit tests. You can safely assume `pandas` works correctly. — hoefling, Apr 18 '20 at 18:31

score 1 · Answer 1 · answered Jul 29 '20 at 14:51

You might find the tdda (Test-Driven Data Analysis) useful, quoting from the docs:

The tdda package provides Python support for test-driven data analysis (see 1-page summary with references, or the blog). The tdda.referencetest library is used to support the creation of reference tests, based on either unittest or pytest. The tdda.constraints library is used to discover constraints from a (Pandas) DataFrame, write them out as JSON, and to verify that datasets meet the constraints in the constraints file. It also supports tables in a variety of relation databases. There is also a command-line utility for discovering and verifying constraints, and detecting failing records. The tdda.rexpy library is a tool for automatically inferring regular expressions from a column in a Pandas DataFrame or from a (Python) list of examples. There is also a command-line utility for Rexpy. Although the library is provided as a Python package, and can be called through its Python API, it also provides command-line tools."

Also see Nick Radcliffe's PyData talk on Test-Driven Data Analysis

score 0 · Accepted Answer · answered Jun 08 '20 at 12:56

Yes, this code is effectively an integration test, which may not be a bad thing.

Even if using pandas is a fixed design decision, there are still many good reasons to abstract from external libraries Testing is one of those. Abstracting from external libraries allows for testing of the business logic independently of the libraries. In this case, abstracting from pandas would make the above a unit test. It would test the interactions with the library.

To apply this pattern, I recommend taking a look at the ports and adapters architecture pattern

However, it does indeed mean that you're no longer testing the functionality provided by pandas. If this is still your specific intent, an integration test is not a bad solution.

How to TDD with pandas and pytest?

2 Answers2