3

I have a dataframe column containing url encoded string such as:

I would like to do something like that:

someDF.withColumn('newcol', URLDecoder.decode( col("mystring"), "utf-8" ))
someDF.show()
|         mystring         |         newcol      |
--------------------------------------------------
| ThisIs%201rstString      | ThisIs 1rstString   |        
| This%20is%3Ethisone      | This is>thisone     |
| and%20so%20one           | and so one          |

How should I do such thing I guess map function is around the corner but can't firgure out how to us it.

Note: this is a sample and it is not an option to create multiple replace statement as there is many other encoded characters and list may vary, I'd like to use a simple reliable method to do so.

Kiwy
  • 340
  • 2
  • 10
  • 43

2 Answers2

8

You can try the SparkSQL builtin function reflect:

reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.

df = spark.createDataFrame([(e,) for e in ["ThisIs%201rstString", "This%20is%3Ethisone", "and%20so%20one"]], ["mystring"])

df.selectExpr("*", "reflect('java.net.URLDecoder','decode', mystring, 'utf-8') as newcol").show()

+-------------------+-----------------+
|           mystring|           newcol|
+-------------------+-----------------+
|ThisIs%201rstString|ThisIs 1rstString|
|This%20is%3Ethisone|  This is>thisone|
|     and%20so%20one|       and so one|
+-------------------+-----------------+

Note: the above is Python code, you should be able to do the same with Scala.

jxc
  • 13,553
  • 4
  • 16
  • 34
  • is there a major difference with defining an UDF ? – Kiwy Sep 10 '20 at 08:41
  • in short words, the builtin functions are simple and less overhead, UDF is not vectorized and usually less performant. this link https://stackoverflow.com/questions/38296609 lists some major points compared to PySpark UDF, but also include Scala/Java. – jxc Sep 10 '20 at 11:52
  • OK, interesting, I never use `pyspark` for several reason, one because it's `python`, two because I know that there's performance hit with `pyspark` though has been reduce in SPARK 3, in that case I would rather use pure scala but `reflect` function seems to only be available in SQL. – Kiwy Sep 10 '20 at 13:33
  • 1
    yeah, it's not directly available with pyspark (not sure about Scala/Java), you can access it using `expr("reflect(......)")` though. – jxc Sep 10 '20 at 13:37
1

Create a UDF that performs the work

import java.net.URLDecoder
def decode(in:String) =  URLDecoder.decode(in, "utf-8")
val decode_udf = udf(decode(_))
df.withColumn("newcol", decode_udf('mystring)).show()

prints the expected result.

werner
  • 13,518
  • 6
  • 30
  • 45
  • Nice it does work but add a huge performance impact, I will try to look for something a bit less heavy. – Kiwy Sep 11 '20 at 09:18