I have a 243MB dataset. I need to update my Dataframe with row_number and I tried using the below methods:
import org.apache.spark.sql.functions._
df.withColumn("Rownumber",functions.monotonically_increasing_id())
Now the row_number getting wrong after 248352
rows, after that row_number
comes 8589934592
like this.
and also I used,
df.registerTempTable("table")
val query = s"select *,ROW_NUMBER() OVER (order by Year) as Rownumber from table"
val z = hiveContext.sql(query)
Using this method, I got the answer but this take more time. Hence I can't use this method.
Same is the problem with df.rdd.zipwithIndex
What is the best way to solve this in spark-scala ? i'm using spark 2.3.0.