1

I have written a PySpark application which joins a large table with 10 lookup tables and then does some transformation on that table using df.when clause. Defining df for each lookup table and joining them take up most lines in the script. How do i unit test this? do I use sc.parallize for each of the lookup table and for the final table and check the transformation? how do you unit test spark application usually?

Amardeep Flora
  • 1,255
  • 6
  • 13
  • 29

1 Answers1

2

shuaiyuan's comment is good and you should use existing frameworks like py.test for testing in Python. To answer the question more directly for DataFrames, I recommend that you don't use sc.parallelize, but instead use spark.createDataFrame to instantiate the DataFrame that you are passing into your function. Then, you can call df.collect() on the output and assert that the number of rows in the output are what you expect as well as the values for the columns.

ktal90
  • 83
  • 6