1

I am new to spark and I am trying to create a rownumer for large dataset. I tried using row_number window function, which works fine but its not performance efficient as I am not using partitionBy clause.

Eg:

 val df= Seq(
        ("041", false),
        ("042", false),
        ("043", false)
      ).toDF("id", "flag")

Result should be :

val df= Seq(
        ("041", false,1),
        ("042", false,2),
        ("043", false,3)
      ).toDF("id", "flag","rownum")

currently I am using

df.withColumn("rownum",row_number().over(Window.orderBy($"id")))

Is there any other way to achieve this result without using window functions? I also tried monotonicallyIncresingID and ZipwithIndex

drlol
  • 333
  • 4
  • 18

1 Answers1

1

You can use monotonicallyIncreasingId to get a rowNum feature

val df2 = df.withColumn("rownum",monotonicallyIncreasingId)

here the index would start with 0.

to start index with 1, one add +1 to the monotonicallyIncreasingId

val df2 = df.withColumn("rownum",monotonicallyIncreasingId+1)

scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> val df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> var df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+


scala> df2 = df.withColumn("rownum",monotonicallyIncreasingId)
df2: org.apache.spark.sql.DataFrame = [id: string, flag: boolean, rownum: bigint]

scala> df2.show
+---+-----+------+
| id| flag|rownum|
+---+-----+------+
|041|false|     0|
|042|false|     1|
|043|false|     2|
+---+-----+------+
Rajat Mishra
  • 3,635
  • 4
  • 27
  • 41
  • "monotonicallyIncreasingId" generates random numbers if we run multiple times, there is no surity of occurrence that generation will be from 0 or 1. – drlol Feb 16 '17 at 13:16
  • other way is to convert the dataframe to `RDD` and use zipWithIndex function. http://stackoverflow.com/questions/23939153/how-to-assign-unique-contiguous-numbers-to-elements-in-a-spark-rdd – Rajat Mishra Feb 16 '17 at 13:32
  • https://issues.apache.org/jira/browse/SPARK-3098 : read this, This is also not serving the purpose.. Thanks!! – drlol Feb 16 '17 at 13:55