I am a complete newbie to Spark. I have an RDD with a column that has the strings of {'Fair', 'Good', 'Better', 'Best'} and I want to create a function that will change those to {1, 2, 3, 4} using a dictionary. This is what I have so far but it is not working, it comes back with string object has no attribute of items. I am using a RDD, not Pandas data frame. I need the function to be able to use UDF to change the original data frame. The function would be followed with spark.udf.register( , ).
Examples of data:
Name Rank Price
Red Best 25.00
Blue Fair 5.00
Yellow Good 8.00
Green Better 20.00
Black Good 12.00
White Fair 7.00
def rank(n):
b = {"Fair": 1, "Good": 2, "Better": 3, "Best": 4}
rep = {v : k for k, v in b.items()}
return rep
spark.udf.register('RANK', rank)
df.select(
'*',
expr('RANK(Rank)')).show(5)
This works:
def rank(n):
if n == "Fair":
return 1
elif n == "Good":
return 2
elif n == "Better":
return 3
elif n == "Best":
return 4
else:
return n
spark.udf.register('RANK(rank), rank)
But I want a simpler formula.