PySpark - Get index of row from UDF

Question

I have a dataframe, I need to get the row number / index of the specific row. I would like to add a new row such that it includes the Letter as well as the row number/index eg. "A - 1","B - 2"

#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])

with output

+------+---------+
|Letter|distances|
+------+---------+
|     A|       20|
|     B|       30|
|     D|       80|
+------+---------+

I would like the new out put to be something like this,

+------+---------------+
|Letter|distances|index|
+------+---------------+
|     A|       20|A - 1|
|     B|       30|B - 2|
|     D|       80|D - 3|
+------+---------------+

This is a function I have been working on

def cate(letter):
    return letter + " - " + #index
a.withColumn("index", cate(a["Letter"])).show()

Perhaps [this](https://stackoverflow.com/a/35948427/3433323) can help? (create ids and then pass the row id as argument to your function `cate`) — mkaran, Dec 21 '17 at 10:29

score 5 · Accepted Answer · answered Dec 21 '17 at 18:35

Since you want to achieve the result using UDF (only) let's try this

from pyspark.sql.functions import udf, monotonically_increasing_id
from pyspark.sql.types import StringType

#sample data
a= sqlContext.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])

def cate(letter, idx):
    return letter + " - " + str(idx)
cate_udf = udf(cate, StringType())
a = a.withColumn("temp_index", monotonically_increasing_id())
a = a.\
    withColumn("index", cate_udf(a.Letter, a.temp_index)).\
    drop("temp_index")
a.show()

Output is:

+------+---------+--------------+
|Letter|distances|         index|
+------+---------+--------------+
|     A|       20|         A - 0|
|     B|       30|B - 8589934592|
|     D|       80|D - 8589934593|
+------+---------+--------------+

monotonically_increasing_id() doesn't give you the sequence starting from a given value, rather it gives a random sequence. — Bala, Dec 22 '17 at 13:20
@Bala - yeah true... but it's increasing so for OP's use case I think it can be used as an sequential index. — Prem, Dec 23 '17 at 10:12

score -1 · Answer 2 · answered Dec 21 '17 at 11:21

This should work

df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "distances"])
df.createOrReplaceTempView("df")

spark.sql("select concat(Letter,' - ',row_number() over (order by Letter)) as num, * from df").show()

+-----+------+---------+                                                        
|  num|Letter|distances|
+-----+------+---------+
|A - 1|     A|       20|
|B - 2|     B|       30|
|D - 3|     D|       80|
+-----+------+---------+

This is a good solution, however I am constrained to using UDF's for pyspark. — Bryce Ramgovind, Dec 21 '17 at 12:50

PySpark - Get index of row from UDF

2 Answers2