0

I know that the dataframe in pyspark has their partition and when I apply a function (udf) on one column, different partition will apply the same function in parallel.

df = sqlCtx.createDataFrame(
    [
        (1, 1, 'A', '2017-01-01'),
        (2, 3, 'B', '2017-01-02'),
        (3, 5, 'A', '2017-01-03'),
        (4, 7, 'B', '2017-01-04')
    ],
    ('index', 'X', 'label', 'date')
)
data=df.rdd.map(lambda x:x['label']).collect()
def ad(x):
    return data.pop(0).lower()
AD=F.udf(ad,StringType())
df.withColumn('station',AD('label')).select('station').rdd.flatMap(lambda x:x).collect()

here is the output:

['a', 'a', 'a', 'a']

which should be:

['a', 'b', 'a', 'b']

And the most strange thing is that

data

didn't even change after we call the functio

data.pop(0)
Zichu Lee
  • 107
  • 1
  • 5
  • This question sounds interessting. Could you please add a [minimal reproducible example[(https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – cronoik Jul 03 '19 at 14:51
  • Hey there, I just add a sample data you can play with. – Zichu Lee Jul 08 '19 at 20:32
  • I get the expected output: `['a', 'b', 'a', 'b']` with Spark 2.4 and Python3. `data` also changes after callin `data.pop(0)`. Could you please check your question again? – cronoik Jul 10 '19 at 14:37
  • @cronoik try increase the number of partition. and then you will get the same result as mine. – Zichu Lee Jul 11 '19 at 21:28

1 Answers1

0

Well, It turns out when the number of partition increases, the function will apply on each partition with the same

data

which means, the data is deepcopyed and will not be change.

Every time we use F.udf, it will deepcopy every variable inside the function.

Zichu Lee
  • 107
  • 1
  • 5