Patching a function in PySpark using MagicMock does not patch in the spark run

Question

I have a unit-test (using PyTest) that runs my PySpark tests. I have the normal conftest.py that creates SQLContext. I would like to get the same uuid4 in all cases, so I patched uuid4 in my test. If I call uuid.uuid4() from the test funnction, all is good.

However, when I run the PySpark job, that also calls uuid4, it is not patched:

My PySpark function (simplified):

def create_uuid_if_needed(current, prev):
    if current > prev:
        return str(uuid.uuid4())
    else:
        return None


def my_df_func(df):
    my_udf = udf(create_uuid_if_needed, T.StringType())    
    my_window = Window.partitionBy(F.col(PARTITIONING_KEY)).orderBy(F.col(ORDER))
    return df.withColumn('new_col', my_udf(df.col, F.lag(df.col, 1)).over(my_window))

My test looks like this:

@patch.object(uuid, 'uuid4', return_value='1-1-1-1')
def test_add_activity_period_start_id(mocker, sql_context, input_fixture):
    input_df = sql_context.createDataFrame(input_fixture, [... schema...])    
    good_uuid = str(uuid.uuid4())
    another_goood_uuid = create_uuid_if_needed(2, 1)
    actual_df = my_df_func(input_df)
    ...

The good_uuid gets the correct value - '1-1-1-1', and so is the another_good_uuid but the dataframe's udf version of the function still calls the non patched uuid4.

What is wrong here? Is it something that the udf() function is doing? Thanks!

Can't you just return the string '1-1-1-1' instead of patching? Anyhow you are using it as a function decorator here instead try using it as test class decorator if you want the patch to work every where — sramalingam24, Apr 10 '19 at 14:27
I can't just return 1-1-1-1, as in prod it should generate a uuid. There is no test class here, just a test function (PyTest) I can put it in conftest.py, but I just simplified the code to make it clear — ronhash, Apr 10 '19 at 14:44
Your code is going to behave different in production than in development? This unit test seems useless, you could add a isUnitTest parameter to create function with default value of False and return the string '1-1-1-1' when unit testing but it makes no sense — sramalingam24, Apr 10 '19 at 15:16
Disabling randomization in unit test is pretty standard. Adding isUnitTest is tainting production code. This is one of the reasons why patching exist. Also this is a strip down of the test, to focus on the problem, not the whole test. — ronhash, Apr 10 '19 at 16:26
I am not sure about disabling randomization but you can control it, that is why they have tools like Faker. Anyway here is how you mock random uuid https://stackoverflow.com/questions/41186818/how-to-generate-a-random-uuid-which-is-reproducible-with-a-seed-in-python — sramalingam24, Apr 12 '19 at 23:19
That's nice, but that's not what this question is about - the problem is the mocking of a function called from a spark `udf` funcrtion - uuid is just one such example — ronhash, Apr 14 '19 at 16:20
have you figured out the solution for this usecase? I'm running into similar issue and pretty sure that spark's `udf` ignore the mock logic. — Minh Thai, Dec 08 '20 at 09:28

Patching a function in PySpark using MagicMock does not patch in the spark run

0 Answers0