Retrieve specific row number data of a column in spark dataset

Question

I have a dataset as below

+---------+
| column1 |
+---------+
| ABC     |
+---------+
| DEF     |
+---------+
| GHI     |
+---------+
| JKL     |
+---------+
| MNO     |
+---------+

Now if have to get the 4th row column value that is JKL. Is there anyway to get that directly. I normally do as below

String dataTemp = df.select("column1").collectAsList().get(3).getAs("column1").toString();

But I don't want to collect as list everytime, which can cause issues when dealing with large datasets.

pasha701 · Accepted Answer · 2019-12-31T14:38:47.810

2

Only limited number of rows can be collected with "take", in Scala:

val fourthRow = df.select("column1").take(4).last

If selection number is big, switch to RDD is possible:

val fourthRow = df.rdd.zipWithIndex().filter(_._2 == 4).keys.collect().head

edited Dec 31 '19 at 14:38

answered Dec 31 '19 at 13:05

pasha701

6,831
1
15
22

We cannot go by this approach when dataset is large and the number can change every time right it cannot be only 4 – John Humanyun Dec 31 '19 at 14:33

blackbishop · Answer 2 · 2019-12-31T13:13:36.673

1

Use row_number to assign each row an index and then select row with rn = 4:

import org.apache.spark.sql.expressions.Window

val row  = df.withColumn("rn", row_number().over(Window.orderBy(lit(1))))
             .filter("rn = 4")
             .select($"column1").first

edited Dec 31 '19 at 13:13

answered Dec 31 '19 at 13:05

blackbishop

30,945
11
55
76

Indeed, this seems to me the only correct way you can achieve this without using collect(). – RudyVerboven Dec 31 '19 at 13:35
1

Message is displayed for such approach: "No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation." In question large dataset mentioned, maybe, such approach will lead to OOM. – pasha701 Dec 31 '19 at 14:30
@pasha701 check [this](https://stackoverflow.com/questions/41313488/avoid-performance-impact-of-a-single-partition-mode-in-spark-window-functions). – blackbishop Dec 31 '19 at 14:46
think this works but is no better than collecting the data, without partitioning, this will move data to 1 machine... – Raphael Roth Dec 31 '19 at 17:24

Retrieve specific row number data of a column in spark dataset

2 Answers2