1

I am using pyspark version 2.4 and I am trying to write a udf which should take the values of column id1 and column id2 together, and returns the reverse string of it.

For example, my data looks like:

+---+---+
|id1|id2|
+---+---+
|  a|one|
|  b|two|
+---+---+

the corresponding code is:

df = spark.createDataFrame([['a', 'one'], ['b', 'two']], ['id1', 'id2'])

The returned value should be

+---+---+----+
|id1|id2| val|
+---+---+----+
|  a|one|enoa|
|  b|two|owtb|
+---+---+----+

My code is:

@udf(string)
def reverse_value(value):
  return value[::-1]

df.withColumn('val', reverse_value(lit('id1' + 'id2')))

My errors are:

TypeError: Invalid argument, not a string or column: <function 
reverse_value at 0x0000010E6D860B70> of type <class 'function'>. For
column literals, use 'lit', 'array', 'struct' or 'create_map'
function.
SkyOne
  • 188
  • 3
  • 15

2 Answers2

1

Should be:

from pyspark.sql.functions import col, concat

df.withColumn('val', reverse_value(concat(col('id1'), col('id2'))))

Explanation:

Additionally it is not clear if argument of udf is correct. It should be either:

from pyspark.sql.functions import udf

@udf
def reverse_value(value):
    ...

or

@udf("string")
def reverse_value(value):
    ...

or

from pyspark.sql.types import StringType

@udf(StringType())
def reverse_value(value):
    ...

Additionally the stacktrace suggests that you have some other problems in your code, not reproducible with the snippet you've shared - the reverse_value seems to return function.

1

The answer by @user11669673 explains what's wrong with your code and how to fix the udf. However, you don't need a udf for this.

You will achieve much better performance by using pyspark.sql.functions.reverse:

from pyspark.sql.functions import col, concat, reverse
df.withColumn("val", concat(reverse(col("id2")), col("id1"))).show()
#+---+---+----+
#|id1|id2| val|
#+---+---+----+
#|  a|one|enoa|
#|  b|two|owtb|
#+---+---+----+
pault
  • 41,343
  • 15
  • 107
  • 149