Creating a function using dictionary to change column from strings to integers

Question

I am a complete newbie to Spark. I have an RDD with a column that has the strings of {'Fair', 'Good', 'Better', 'Best'} and I want to create a function that will change those to {1, 2, 3, 4} using a dictionary. This is what I have so far but it is not working, it comes back with string object has no attribute of items. I am using a RDD, not Pandas data frame. I need the function to be able to use UDF to change the original data frame. The function would be followed with spark.udf.register( , ).

Examples of data:
Name    Rank     Price
Red     Best     25.00
Blue    Fair     5.00
Yellow  Good     8.00
Green   Better   20.00
Black   Good     12.00
White   Fair     7.00

def rank(n):
    b = {"Fair": 1, "Good": 2, "Better": 3, "Best": 4}
    rep = {v : k for k, v in b.items()}
    return rep
    

spark.udf.register('RANK', rank)
df.select(
'*',
expr('RANK(Rank)')).show(5)

This works:

def rank(n):
    if n == "Fair":
        return 1
    elif n == "Good":
        return 2
    elif n == "Better":
        return 3
    elif n == "Best":
        return 4
    else:
        return n
spark.udf.register('RANK(rank), rank)

But I want a simpler formula.

could you please share a small example input data to test potential code, also your expected output. — samkart, May 28 '21 at 04:43
I have a RDD with three columns: name, rank, and price. The rank column has the strings of fair, good, better, and best. I want these to become a column of integers from 1 to 4. I would use the function in a UDF to change the column. Does this make sense? — Mokonalove, May 28 '21 at 04:56
are you using dataframes or RDD transformations? that'll change the meaning significantly. in any case, could you share this detail of your rdd with sample values in your question? refer [here](https://stackoverflow.com/help/minimal-reproducible-example) or [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610) — samkart, May 28 '21 at 05:03
if you're using dataframes (spark also has dataframes, not all dataframe are pandas dataframes!), then you can easily implement a `pandas_udf()`. the function should return a value (the rank value) and should be able to ingest the rank (a value from the rank field). So, the function takes the character rank and returns integer rank - basic flow. BTW, your question seems to be hinting that you want to use `RDD` and not spark dataframes. — samkart, May 28 '21 at 07:47
you're actually using spark dataframe in your example stated above, not an RDD explicitly. — samkart, May 28 '21 at 12:04

score 0 · Answer 1 · answered May 28 '21 at 05:02

from pyspark.sql.functions import col, create_map, lit
from itertools import chain

mapping_expr = create_map([lit(x) for x in chain(*mapping.items())])

df.withColumn("new_column", mapping_expr.getItem(col("old_column")))

where mapping is your dict - don’t call it list, this name is already used by the list class in Python.

Creating a function using dictionary to change column from strings to integers

1 Answers1