I'm getting a runtime error because I'm trying to pass a vector into a scalar function.
# https://stackoverflow.com/questions/44018182/how-to-do-string-transformation-in-pyspark
# https://stackoverflow.com/questions/5649407/how-to-convert-hexadecimal-string-to-bytes-in-python
# https://www.educba.com/pyspark-withcolumn/
# https://sparkbyexamples.com/pyspark/pyspark-sql-types-datatype-with-examples
# https://stackoverflow.com/questions/58504371/apply-function-to-all-values-in-array-column-pyspark
# https://stackoverflow.com/questions/41184116/convert-pyspark-dataframe-column-type-to-string-and-replace-the-square-brackets
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
def decodeRowKey(a):
return bytearray.fromhex(a).decode("utf-8")
# BEGIN These lines work as expected and return PL/SQL and Groovy respectively
testDecodeRowKeyFunc = decodeRowKey("504C2F53514C")
testDecodeRowKeyFunc2 = decodeRowKey("47726F6F7679")
# END These lines work as expected and return PL/SQL and Grovvy respectively
originalDF = sqlContext.table("tiobe_azure_backup")
originalRowKey = col("RowKey")
display(originalDF)
# BEGIN This doesn't work
# decodedColumn = originalRowKey.map(decodeRowKey)
# END This doesn't work
decodedColumn = decodeRowKey(originalRowKey.cast("string"))
transformedDF = originalDF.withColumn('RowKey2', decodedColumn)
display(transformedDF)
If you look at the picture, you will see I'm getting
TypeError: fromhex() argument must be str, not Column
How do I run my function decodeRowKey on every single element of the column originalRowKey so I can get a new column which I will name decodedColumn?
decodeRowKey is a transformation function. I want to do something like this pseudo-code:
newColumnOfStrings = map(decodeRowKey, oldColumnOfStrings)