dataframe use udf have problem caused by partition

Question

I know that the dataframe in pyspark has their partition and when I apply a function (udf) on one column, different partition will apply the same function in parallel.

df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)

data=df.rdd.map(lambda x:x['label']).collect()

def ad(x):
    return data.pop(0).lower()
AD=F.udf(ad,StringType())

df.withColumn('station',AD('label')).select('station').rdd.flatMap(lambda x:x).collect()

here is the output:

['a', 'a', 'a', 'a']

which should be:

['a', 'b', 'a', 'b']

And the most strange thing is that

data

didn't even change after we call the functio

data.pop(0)

This question sounds interessting. Could you please add a [minimal reproducible example[(https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). — cronoik, Jul 03 '19 at 14:51
I get the expected output: `['a', 'b', 'a', 'b']` with Spark 2.4 and Python3. `data` also changes after callin `data.pop(0)`. Could you please check your question again? — cronoik, Jul 10 '19 at 14:37
@cronoik try increase the number of partition. and then you will get the same result as mine. — Zichu Lee, Jul 11 '19 at 21:28

score 0 · Answer 1 · answered Jul 11 '19 at 21:42

Well, It turns out when the number of partition increases, the function will apply on each partition with the same

data

which means, the data is deepcopyed and will not be change.

Every time we use F.udf, it will deepcopy every variable inside the function.

dataframe use udf have problem caused by partition

1 Answers1