0

I have a dataframe of one column only. I would like to split the string using the pandas_udf in pyspark. Hence, I have the following code:

from pyspark.sql.functions import pandas_udf, PandasUDFType

@pandas_udf('str')
def split_msg(string):
  msg_ = string.split(" ")
  return msg_

temp = temp.select("_c6").withColumn("decoded", 
split_msg(temp._c6)).drop("_c6")

But this is not working.

any help is much appreciated!!

I. A
  • 2,252
  • 26
  • 65

1 Answers1

2

Change your function to the following:

@pandas_udf('array<string>', PandasUDFType.SCALAR) 
def split_msg(string): 
    msg_ = string.str.split(" ") 
    return msg_ 

basically, your function returnType should be array of StringType() and the argument string should be a Series and thus you will need string.str.split(" ")

However, if you just want to split the text, Spark's DataFrame API provides a built-in function, pyspark.sql.functions.split which should be more efficient than using a pandas_udf

jxc
  • 13,553
  • 4
  • 16
  • 34