I have a DataFrame on pyspark (called GPS) and I want to get its columns' data as lists, with each row of each column being an element of a list, so I used the following list comprehension:
ls = [x.GPS_COORDINATES for x in GPS.collect()]
This works as expected, by when I try to pass this as a UDF and apply on the whole DataFrame, as follows:
from pyspark.sql.types import ArrayType, StringType
import pandas as pd
def col_pipe_duplicater(col):
ls = [x.GPS_COORDINATES for x in col.collect()]
return ls
pipe_remover_udf = udf(col_pipe_duplicater)#, ArrayType(StringType()))
(
GPS.select('GPS_COORDINATES',
GPS.withColumn('New_col', pipe_remover_udf('GPS_COORDINATES')))
)
I get the following error:
Invalid argument, not a string or column: DataFrame[GPS_COORDINATES: string, New_col: string] of type <class 'pyspark.sql.dataframe.DataFrame'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
Any idea on how to debug this? (If needed, I am using Jupyter Docker Stack (pyspark-notebook) with Spark 2.3.1
on a MacBook Pro)
Thanks a lot,