2

I'm starting to work with pandas udf on a Pyspark Jupyter notebook running on an EMR cluster using this 'identity' pandas udf and I'm getting the following error:

@pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
# Input/output are both a pandas.DataFrame
def pudf(pdf):

    return pdf

df.filter(df.corp_cust=='LO').groupby('corp_cust').apply(pudf).show()

An error occurred while calling o388.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 113.0 failed 4 times, most recent failure: Lost task 0.3 in stage 113.0 (TID 1666, ip-10-23-226-64.us.scottsco.com, executor 1): java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:543) at org.apache.arrow.vector.ipc.message.MessageChannelReader.readNext(MessageChannelReader.java:58) at org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:132)

I can run df.filter(df.corp_cust=='LO').show() with success so this makes me think things are 'braking' somewhere in translation from pandas to pyspark dataframe.

This dataframe has a couple StringType and DecimalType columns. I've also tried encoding the string columns to 'utf-8' within the udf and get the same error.

Any suggestion on how to fix this?

Alfredo Lozano
  • 290
  • 1
  • 3
  • 15

1 Answers1

2

This is apparently an issue[1] with pyarrow version 0.15 that causes pandads udf to through error. you can try to change version by installing Pyarrow 0.14.1 or lower.

  sc.install_pypi_package("pyarrow==0.14.1") 

[1]https://issues.apache.org/jira/browse/SPARK-29367

K.S.
  • 2,846
  • 4
  • 24
  • 32
A.B
  • 20,110
  • 3
  • 37
  • 71
  • I have exactly the same issue, but with Spark 3.3.1. However, it does not accept this version of Pyarrow (it needs at least 1.0.0), but with this version is failing too. I am convinced it is a question of version, not about the code as I am trying the code in the doc. Is there any chance you know the correct version for Spark 3.3 ? – mountrix Sep 01 '23 at 18:37