UDF Invalid argument error even if the python function works as expected

Question

I have a DataFrame on pyspark (called GPS) and I want to get its columns' data as lists, with each row of each column being an element of a list, so I used the following list comprehension:

 ls = [x.GPS_COORDINATES for x in GPS.collect()]

This works as expected, by when I try to pass this as a UDF and apply on the whole DataFrame, as follows:

 from pyspark.sql.types import ArrayType, StringType
 import pandas as pd

 def col_pipe_duplicater(col):
     ls = [x.GPS_COORDINATES for x in col.collect()]
     return ls

 pipe_remover_udf = udf(col_pipe_duplicater)#, ArrayType(StringType()))

 (
     GPS.select('GPS_COORDINATES',
         GPS.withColumn('New_col', pipe_remover_udf('GPS_COORDINATES')))
 )

I get the following error:

Invalid argument, not a string or column: DataFrame[GPS_COORDINATES: string, New_col: string] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Any idea on how to debug this? (If needed, I am using Jupyter Docker Stack (pyspark-notebook) with Spark 2.3.1 on a MacBook Pro)

Thanks a lot,

Why are you calling `collect` inside your udf? What are you trying to do? — pault, Aug 09 '18 at 13:17
The way you are using `select` looks incorrect too. Also the collect thing won't work. You can't pass a dataframe to a Spark UDF/ — philantrovert, Aug 09 '18 at 13:22
@pault I am trying to convert the column into a list, this seems to be working since when I do GPS.collect() I get a list of the rows in my spark dataframe. — A.Luc, Aug 09 '18 at 13:31
@A.Luc there are serious syntax errors with your sample code, so it's extremely difficult to understand what you are asking. Please read [how to create good reproducible apache spark dataframe examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-dataframe-examples) and try to provide us with a [mcve] that shows a small sample input dataframe with the corresponding desired output. — pault, Aug 09 '18 at 13:57

UDF Invalid argument error even if the python function works as expected

0 Answers0