How to reverse and combine string columns in a spark dataframe?

Question

I am using pyspark version 2.4 and I am trying to write a udf which should take the values of column id1 and column id2 together, and returns the reverse string of it.

For example, my data looks like:

+---+---+
|id1|id2|
+---+---+
|  a|one|
|  b|two|
+---+---+

the corresponding code is:

df = spark.createDataFrame([['a', 'one'], ['b', 'two']], ['id1', 'id2'])

The returned value should be

+---+---+----+
|id1|id2| val|
+---+---+----+
|  a|one|enoa|
|  b|two|owtb|
+---+---+----+

My code is:

@udf(string)
def reverse_value(value):
  return value[::-1]

df.withColumn('val', reverse_value(lit('id1' + 'id2')))

My errors are:

TypeError: Invalid argument, not a string or column: <function 
reverse_value at 0x0000010E6D860B70> of type <class 'function'>. For
column literals, use 'lit', 'array', 'struct' or 'create_map'
function.

user11669673 · Accepted Answer · 2019-06-19T10:26:43.427

Should be:

from pyspark.sql.functions import col, concat

df.withColumn('val', reverse_value(concat(col('id1'), col('id2'))))

Explanation:

lit is a literal while you want to refer to individual columns (col).
Columns have to be concatenated using concat function (Concatenate columns in Apache Spark DataFrame)

Additionally it is not clear if argument of udf is correct. It should be either:

from pyspark.sql.functions import udf

@udf
def reverse_value(value):
    ...

or

@udf("string")
def reverse_value(value):
    ...

or

from pyspark.sql.types import StringType

@udf(StringType())
def reverse_value(value):
    ...

Additionally the stacktrace suggests that you have some other problems in your code, not reproducible with the snippet you've shared - the reverse_value seems to return function.

score 1 · Answer 2 · answered Jun 19 '19 at 14:08

The answer by @user11669673 explains what's wrong with your code and how to fix the udf. However, you don't need a udf for this.

You will achieve much better performance by using pyspark.sql.functions.reverse:

from pyspark.sql.functions import col, concat, reverse
df.withColumn("val", concat(reverse(col("id2")), col("id1"))).show()
#+---+---+----+
#|id1|id2| val|
#+---+---+----+
#|  a|one|enoa|
#|  b|two|owtb|
#+---+---+----+

How to reverse and combine string columns in a spark dataframe?

2 Answers2