2

I have a unit-test (using PyTest) that runs my PySpark tests. I have the normal conftest.py that creates SQLContext. I would like to get the same uuid4 in all cases, so I patched uuid4 in my test. If I call uuid.uuid4() from the test funnction, all is good.

However, when I run the PySpark job, that also calls uuid4, it is not patched:

My PySpark function (simplified):

def create_uuid_if_needed(current, prev):
    if current > prev:
        return str(uuid.uuid4())
    else:
        return None


def my_df_func(df):
    my_udf = udf(create_uuid_if_needed, T.StringType())    
    my_window = Window.partitionBy(F.col(PARTITIONING_KEY)).orderBy(F.col(ORDER))
    return df.withColumn('new_col', my_udf(df.col, F.lag(df.col, 1)).over(my_window))

My test looks like this:

@patch.object(uuid, 'uuid4', return_value='1-1-1-1')
def test_add_activity_period_start_id(mocker, sql_context, input_fixture):
    input_df = sql_context.createDataFrame(input_fixture, [... schema...])    
    good_uuid = str(uuid.uuid4())
    another_goood_uuid = create_uuid_if_needed(2, 1)
    actual_df = my_df_func(input_df)
    ...

The good_uuid gets the correct value - '1-1-1-1', and so is the another_good_uuid but the dataframe's udf version of the function still calls the non patched uuid4.

What is wrong here? Is it something that the udf() function is doing? Thanks!

ronhash
  • 854
  • 7
  • 16
  • Can't you just return the string '1-1-1-1' instead of patching? Anyhow you are using it as a function decorator here instead try using it as test class decorator if you want the patch to work every where – sramalingam24 Apr 10 '19 at 14:27
  • I can't just return 1-1-1-1, as in prod it should generate a uuid. There is no test class here, just a test function (PyTest) I can put it in conftest.py, but I just simplified the code to make it clear – ronhash Apr 10 '19 at 14:44
  • Your code is going to behave different in production than in development? This unit test seems useless, you could add a isUnitTest parameter to create function with default value of False and return the string '1-1-1-1' when unit testing but it makes no sense – sramalingam24 Apr 10 '19 at 15:16
  • Disabling randomization in unit test is pretty standard. Adding isUnitTest is tainting production code. This is one of the reasons why patching exist. Also this is a strip down of the test, to focus on the problem, not the whole test. – ronhash Apr 10 '19 at 16:26
  • I am not sure about disabling randomization but you can control it, that is why they have tools like Faker. Anyway here is how you mock random uuid https://stackoverflow.com/questions/41186818/how-to-generate-a-random-uuid-which-is-reproducible-with-a-seed-in-python – sramalingam24 Apr 12 '19 at 23:19
  • That's nice, but that's not what this question is about - the problem is the mocking of a function called from a spark `udf` funcrtion - uuid is just one such example – ronhash Apr 14 '19 at 16:20
  • have you figured out the solution for this usecase? I'm running into similar issue and pretty sure that spark's `udf` ignore the mock logic. – Minh Thai Dec 08 '20 at 09:28

0 Answers0