4

For example I'd like to assert that two Pyspark DataFrame's have the same data, however just using == checks that they are the same object. Ideally I'd also like to be specify whether order matters or not.

I've tried writing a function that raises an AssertionError but that adds a lot of noise to the pytest output as it shows the traceback from that function.

The other thought I had was to mock the __eq__ method of the DataFrames but I'm not confident that's the right way to go.

Edit:

I considered just using a function that returns true or false instead of an operator, however that doesn't seem to work with pytest_assertrepr_compare. I'm not familiar enough with how that hook works so it's possible there is a way to use it with a function instead of an operator.

kfoley
  • 320
  • 1
  • 9
  • It seems it is not as simple as it sounds. [link](https://stackoverflow.com/questions/31197353/dataframe-equality-in-apache-spark) – IMCoins Feb 09 '19 at 13:42
  • I'm fine with comparing the two DataFrames, my question is how do I use that comparison logic within a pytest assertion – kfoley Feb 09 '19 at 13:47
  • If you want to do that specific test, you could make a function that returns True or False for the comparison you wish to make and then, make a simple IsTrue assertion ? Something like that. I'm aware of unittest and haven't tried pytest, but it must be the same logic, right ? – IMCoins Feb 09 '19 at 14:00

4 Answers4

4

My current solution is to use a patch to override the DataFrame's __eq__ method. Here's an example with Pandas as it's faster to test with, the idea should apply to any object.

import pandas as pd
# use this import for python3
# from unittest.mock import patch
from mock import patch


def custom_df_compare(self, other):
    # Put logic for comparing df's here
    # Returning True for demonstration
    return True


@patch("pandas.DataFrame.__eq__", custom_df_compare)
def test_df_equal():
    df1 = pd.DataFrame(
        {"id": [1, 2, 3], "name": ["a", "b", "c"]}, columns=["id", "name"]
    )
    df2 = pd.DataFrame(
        {"id": [2, 3, 4], "name": ["b", "c", "d"]}, columns=["id", "name"]
    )

    assert df1 == df2

Haven't tried it yet but am planning on adding it as a fixture and using autouse to use it for all tests automatically.

In order to elegantly handle the "order matters" indicator, I'm playing with an approach similar to pytest.approx which returns a new class with it's own __eq__ for example:

class SortedDF(object):
    "Indicates that the order of data matters when comparing to another df"

    def __init__(self, df):
        self.df = df

    def __eq__(self, other):
        # Put logic for comparing df's including order of data here
        # Returning True for demonstration purposes
        return True


def test_sorted_df():
    df1 = pd.DataFrame(
        {"id": [1, 2, 3], "name": ["a", "b", "c"]}, columns=["id", "name"]
    )
    df2 = pd.DataFrame(
        {"id": [2, 3, 4], "name": ["b", "c", "d"]}, columns=["id", "name"]
    )

    # Passes because SortedDF.__eq__ is used
    assert SortedDF(df1) == df2
    # Fails because df2's __eq__ method is used
    assert df2 == SortedDF(df2)

The minor issue I haven't been able to resolve is the failure of the second assert, assert df2 == SortedDF(df2). This order works fine with pytest.approx but doesn't here. I've tried reading up on the == operator but haven't been able to figure out how to fix the second case.

kfoley
  • 320
  • 1
  • 9
1

To do a raw comparison between the values of the DataFrames (must be exact order), you can do something like this:

import pandas as pd
from pyspark.sql import Row

df1 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
df2 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])

pd.testing.assert_frame_equal(df1.toPandas(), df2.toPandas())

If you want to specify by order, you can do some transformations on the pandas DataFrame to sort by a particular column first using the following function:

def assert_frame_equal_with_sort(results, expected, keycolumns):
  results = results.reindex(sorted(results.columns), axis=1)
  expected = expected.reindex(sorted(expected.columns), axis=1)

  results_sorted = results.sort_values(by=keycolumns).reset_index(drop=True)
  expected_sorted = expected.sort_values(by=keycolumns).reset_index(drop=True)

  pd.testing.assert_frame_equal(results_sorted, expected_sorted)


df1 = spark.createDataFrame([Row(a=1, b=2, c=3), Row(a=1, b=3, c=3)])
df2 = spark.createDataFrame([Row(a=1, b=3, c=3), Row(a=1, b=2, c=3)])

assert_frame_equal_with_sort(df1.toPandas(), df2.toPandas(), ['b'])
Tanjin
  • 2,442
  • 1
  • 13
  • 20
  • I'm looking more for how to use something like this with pytest, I already have the code to compare the DataFrame's – kfoley Feb 09 '19 at 17:04
1

just use the pandas.Dataframe.equals method https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.equals.html

For example

assert df1.equals(df2)

assert can be used with anything that returns a boolean. So yes you can write any custom comparison function to compare two objects. As long as the custom function returns a boolean. However, in this case there is no need for a custom function as pandas already provides one

Arran Duff
  • 1,214
  • 2
  • 11
  • 23
  • The issue with this approach is it doesn't seem to work with `pytest_assertrepr_compare` which I'd also like to take advantage of. That acts as a hook that receives the operator, left, and right elements and lets you define how the failure should show in the log. I'll add that detail to my question. – kfoley Feb 09 '19 at 18:00
-1

You can use one of pytest hooks, particularity the pytest_assertrepr_compare. In there you can define what tyou you want to compare and how, also docs are pretty good and with examples. Best of luck. :)

Daniel
  • 980
  • 9
  • 20
  • 6
    I originally thought that was the answer as well but my understanding is that is simply used to drive how the data is represented in the log once the `assert` call fails, it has no control over how the `assert` is handled. – kfoley Feb 09 '19 at 17:27