0

I have some tests on my pytest suite that compare dataframes with assert df1.collect() == df2.collect().

If I execute the code inside the Pychar IDE the tests passes, if I execute the tests in console an assertion error is raised.

After some debugging, I found that when I execute the test with the console the collected results are disordered.

For example, if my dataframe has two rows, this code will pass in Pycharm but it fails in console:

 assert df1.collect()[0] == df2.collect()[0]

And this one will fail in Pycharm but it will pass in console:

assert df1.collect()[1] == df2.collect()[0]

I've tried to invoke pytest with python3 -m pytest and just with pytest. Pycharm and the console are using the same venv

Fran Arenas
  • 639
  • 6
  • 18

1 Answers1

1

To my knowledge .collect() does not guarantee any order. Since the data is being sent to the driver from possibly multiple executors it could be that one executor is faster than the other. Instead of comparing single elements you should rather compare the lists as a whole if possible.

E.g.

assertCountEqual(df1.collect(), df2.collect())
Robert Kossendey
  • 6,733
  • 2
  • 12
  • 42
  • Thank you for the answer. In my case, I am using pytest but I found this other question that applies assertCountEqual on pytest https://stackoverflow.com/questions/41605889/does-pytest-have-an-assertitemsequal-assertcountequal-equivalent. Another solution would be to implement manually an O(nxn) comparation – Fran Arenas Jul 01 '22 at 12:51
  • Yes, both would work. If the answer helped you, an upvote and acceptance would be appreciated :) – Robert Kossendey Jul 01 '22 at 12:53