0

I'm getting a runtime error because I'm trying to pass a vector into a scalar function.

# https://stackoverflow.com/questions/44018182/how-to-do-string-transformation-in-pyspark
# https://stackoverflow.com/questions/5649407/how-to-convert-hexadecimal-string-to-bytes-in-python
# https://www.educba.com/pyspark-withcolumn/
# https://sparkbyexamples.com/pyspark/pyspark-sql-types-datatype-with-examples
# https://stackoverflow.com/questions/58504371/apply-function-to-all-values-in-array-column-pyspark
# https://stackoverflow.com/questions/41184116/convert-pyspark-dataframe-column-type-to-string-and-replace-the-square-brackets
from pyspark.sql.functions import col
from pyspark.sql.types import StringType

def decodeRowKey(a):
    return bytearray.fromhex(a).decode("utf-8")

# BEGIN These lines work as expected and return PL/SQL and Groovy respectively
testDecodeRowKeyFunc = decodeRowKey("504C2F53514C")
testDecodeRowKeyFunc2 = decodeRowKey("47726F6F7679")
# END   These lines work as expected and return PL/SQL and Grovvy respectively

originalDF = sqlContext.table("tiobe_azure_backup")

originalRowKey = col("RowKey")
display(originalDF)

# BEGIN This doesn't work
# decodedColumn = originalRowKey.map(decodeRowKey)
# END   This doesn't work
decodedColumn = decodeRowKey(originalRowKey.cast("string"))

transformedDF = originalDF.withColumn('RowKey2', decodedColumn)
display(transformedDF)

If you look at the picture, you will see I'm getting
TypeError: fromhex() argument must be str, not Column enter image description here

How do I run my function decodeRowKey on every single element of the column originalRowKey so I can get a new column which I will name decodedColumn?

decodeRowKey is a transformation function. I want to do something like this pseudo-code:
newColumnOfStrings = map(decodeRowKey, oldColumnOfStrings)

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Shawn Eary
  • 684
  • 2
  • 7
  • 21

1 Answers1

1

Your problem is that you're using Python API instead of using native Spark APIs.

You need to use unhex function (doc) and cast result to a string, something like this:

from pyspark.sql.functions import col, unhex

transformedDF = originalDF.withColumn('RowKey2', unhex(col('RowKey')).cast("string"))

if your column is a string array, then you need to add the transform function (doc) into a mix:

from pyspark.sql.functions import col, unhex, transform

transformedDF = originalDF.withColumn('RowKey2', 
  transform(col('RowKey'), lambda x: unhex(x).cast("string")))
Alex Ott
  • 80,552
  • 8
  • 87
  • 132