3

I have a 243MB dataset. I need to update my Dataframe with row_number and I tried using the below methods:

import org.apache.spark.sql.functions._
df.withColumn("Rownumber",functions.monotonically_increasing_id()) 

Now the row_number getting wrong after 248352 rows, after that row_number comes 8589934592 like this.

and also I used,

df.registerTempTable("table")

val query = s"select *,ROW_NUMBER() OVER (order by Year) as Rownumber  from table"

val z = hiveContext.sql(query)

Using this method, I got the answer but this take more time. Hence I can't use this method.

Same is the problem with df.rdd.zipwithIndex

What is the best way to solve this in spark-scala ? i'm using spark 2.3.0.

philantrovert
  • 9,904
  • 3
  • 37
  • 61
Aswathy
  • 63
  • 6
  • 2
    Is it really 243 megabytes? You shouldn't need Spark at all. And `ROW_NUMBER()`, as inefficient as is, shouldn't be a performance issue with such small amount of data. Neither should be `rdd.zipWithindex`. Still, you've enumerated all available methods (and `monotonically_increasing_id` works as it suppose to). – Alper t. Turker May 16 '18 at 13:22

0 Answers0