1

I have a dataset as below

+---------+
| column1 |
+---------+
| ABC     |
+---------+
| DEF     |
+---------+
| GHI     |
+---------+
| JKL     |
+---------+
| MNO     |
+---------+

Now if have to get the 4th row column value that is JKL. Is there anyway to get that directly. I normally do as below

String dataTemp = df.select("column1").collectAsList().get(3).getAs("column1").toString();

But I don't want to collect as list everytime, which can cause issues when dealing with large datasets.

John Humanyun
  • 915
  • 3
  • 10
  • 25

2 Answers2

2

Only limited number of rows can be collected with "take", in Scala:

val fourthRow = df.select("column1").take(4).last

If selection number is big, switch to RDD is possible:

val fourthRow = df.rdd.zipWithIndex().filter(_._2 == 4).keys.collect().head
pasha701
  • 6,831
  • 1
  • 15
  • 22
1

Use row_number to assign each row an index and then select row with rn = 4:

import org.apache.spark.sql.expressions.Window

val row  = df.withColumn("rn", row_number().over(Window.orderBy(lit(1))))
             .filter("rn = 4")
             .select($"column1").first
blackbishop
  • 30,945
  • 11
  • 55
  • 76
  • Indeed, this seems to me the only correct way you can achieve this without using collect(). – RudyVerboven Dec 31 '19 at 13:35
  • 1
    Message is displayed for such approach: "No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation." In question large dataset mentioned, maybe, such approach will lead to OOM. – pasha701 Dec 31 '19 at 14:30
  • @pasha701 check [this](https://stackoverflow.com/questions/41313488/avoid-performance-impact-of-a-single-partition-mode-in-spark-window-functions). – blackbishop Dec 31 '19 at 14:46
  • think this works but is no better than collecting the data, without partitioning, this will move data to 1 machine... – Raphael Roth Dec 31 '19 at 17:24