3

I need to add a column to my dataframe that would increment by 1 but starting from 500. So the first row would be 500, the second one 501 etc. It doesn't make sense to use UDF, since it can be executed on a different workers and I don't know any function that would take starting value as a parameter. I don't have anything that I could sort my dataframe on either. Both row number and auto increment would start on 1 by default. I believe I can do it in but transforming my df to rdd and back to df seems to be quite ugly solution. Do you know of any existing function that would help me to solve in on a dataframe level?

Thank you!

Grevioos
  • 355
  • 5
  • 30

2 Answers2

1

Since monotonically_increasing_id() isn't consecutive, you can use row_num() over monotonically_increasing_id() and add 499.

from pyspark.sql.window import Window

df = df.withColumn("idx", monotonically_increasing_id())
w = Window().orderBy("idx")
df.withColumn("row_num", (499 + row_number().over(w))).show()
Cena
  • 3,316
  • 2
  • 17
  • 34
  • Using a window without partitions might have a [performance impact](https://stackoverflow.com/a/41316277/2129801) – werner Oct 05 '20 at 16:57
  • or using spark sql. `spark.sql('select row_number() over (order by "idx") as row_num, * from df')` – Cena Oct 05 '20 at 17:01
0

I think you can use monotonically_increasing_id function which starts from 0, but you can start from a custom offset by adding a constant value to each offset:

offset = start_offset + monotonically_increasing_id()
Hussein Awala
  • 4,285
  • 2
  • 9
  • 23
  • 1
    `monotonically_increasing_id` does not take any argument plus simply adding 500 shoud do the work : `monotonically_increasing_id() + 500` – Steven Oct 05 '20 at 12:51
  • Thank you, however, monotonically_increasing_id is not consecutive, so it works only for a first few rows. – Grevioos Oct 05 '20 at 15:53