Spark DataFrame - Select n random rows

Question

I have a dataframe with multiple thousands of records, and I'd like to randomly select 1000 rows into another dataframe for demoing. How can I do this in Java?

Thank you!

Have you alredy tried to use a HiveQL query using Spark SQL? — Umberto Griffo, Sep 06 '16 at 09:56
@Umberto Remember that question is about getting n random rows, not n first rows. Author of this question can implement own sampling or use one of possibility implemented in Spark — T. Gawęda, Sep 06 '16 at 10:06
@T.Gawęda I know it, but with HiveQL (Spark SQL is designed to be compatible with the Hive) you can create a select statement that randomly select n rows in efficient way, and you can use that. Why not? It's an other way — Umberto Griffo, Sep 06 '16 at 10:14
@Umberto Can you post such code? It sounds good! But remember, than LIMIT doesn't return random results, see http://stackoverflow.com/questions/23802115/is-limit-clause-in-hive-really-random — T. Gawęda, Sep 06 '16 at 10:21
I'm okay with using Hive as long as it solves the problem :) Right now I've found that `recent_orders = recent_orders.sample(true, 0.5).limit(1000);` is supposed to do the trick, but I'm open for better solutions! — lte__, Sep 06 '16 at 15:14
It's better to calculate fraction if you want to get specified number of rows - just to not waste resources :) I've searched for better solutions, however currently I don't see any. Maybe it will be good to raise JIRA ticket for such functionality — T. Gawęda, Sep 07 '16 at 08:19

score 56 · Answer 1 · edited Nov 17 '22 at 18:14

56

In Python, You can shuffle the rows and then take the top ones:

import org.apache.spark.sql.functions.rand

dataset.orderBy(rand()).limit(n)

edited Nov 17 '22 at 18:14

Hadij

3,661
5
26
48

answered Sep 02 '17 at 16:12

apatry

742
1
5
6

9

is this implementation efficient? what is the cost of Order by? – Hasson Jul 12 '18 at 23:37
7

very simple but highly inefficient. If you're happy to have a rough number of rows, better to use a filter vs. a fraction, rather than populating and sorting an entire random vector to get the `n` smallest values – MichaelChirico May 21 '19 at 11:50

T. Gawęda · Accepted Answer · 2016-09-06T09:44:11.387

You can try sample () method. Unfourtunatelly you must give there not a number, but fraction. You can write function like this:

def getRandom (dataset : Dataset[_], n : Int) = {
    val count = dataset.count();
    val howManyTake = if (count > n) n else count;
    dataset.sample(0, 1.0*howManyTake/count).limit (n)
}

Explanation: we must take a fraction of data. If we have 2000 rows and you want to get 100 rows, we must have 0.5 of total rows. If you want to get more rows than there are in DataFrame, you must get 1.0. limit () function is invoked to make sure that rounding is ok and you didn't get more rows than you specified.

Edit: I see in other answer the takeSample method. But remember:

It'a a method of RDD, not Dataset, so you must do: dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF() takeSample will collect all values.
Remember that if you want to get very many rows then you will have problems with OutOfMemoryError as takeSample is collecting results in driver. Use it carefully

Is there a way to do it without counting the data frame as this operation will be too expensive in large DF. — Hasson, Dec 31 '17 at 21:18
@Hasson Try to cache DataFrame, so the second action will be much faster. Or you can also use approxQuantile function, it will be faster but less precise — T. Gawęda, Jan 05 '18 at 15:19
Giving some margin may help. `df.sample(math.min(1.0, 1.1 * howManyTake / count)).limit(n)` — Hyunjun Kim, Feb 07 '22 at 03:05

score 5 · Answer 3 · answered Sep 02 '22 at 16:15

5

In Pyspark >= `3.1`, try this:

sdf.sample(fraction=1.0).limit(n)

answered Sep 02 '22 at 16:15

s510

2,271
11
18

score 4 · Answer 4 · answered Oct 05 '20 at 20:09

4

I would prefer this in pyspark

df.sample(withReplacement=False, fraction=desired_fraction)

Here is doc

answered Oct 05 '20 at 20:09

dheeraj .A

1,073
7
6

4

This does not give the exact number you want sampled, which is really unexpected. – Nikki Apr 02 '21 at 21:08

Spark DataFrame - Select n random rows

4 Answers4

In Pyspark >= `3.1`, try this:

Linked

Spark DataFrame - Select n random rows

4 Answers4

In Pyspark >= 3.1, try this:

Linked

In Pyspark >= `3.1`, try this: