37

I have a dataframe with multiple thousands of records, and I'd like to randomly select 1000 rows into another dataframe for demoing. How can I do this in Java?

Thank you!

lte__
  • 7,175
  • 25
  • 74
  • 131
  • Have you alredy tried to use a HiveQL query using Spark SQL? – Umberto Griffo Sep 06 '16 at 09:56
  • Yes, but I don't see the relevance. – lte__ Sep 06 '16 at 10:05
  • @Umberto Remember that question is about getting n random rows, not n first rows. Author of this question can implement own sampling or use one of possibility implemented in Spark – T. Gawęda Sep 06 '16 at 10:06
  • @T.Gawęda I know it, but with HiveQL (Spark SQL is designed to be compatible with the Hive) you can create a select statement that randomly select n rows in efficient way, and you can use that. Why not? It's an other way – Umberto Griffo Sep 06 '16 at 10:14
  • @Umberto Can you post such code? It sounds good! But remember, than LIMIT doesn't return random results, see http://stackoverflow.com/questions/23802115/is-limit-clause-in-hive-really-random – T. Gawęda Sep 06 '16 at 10:21
  • I'm okay with using Hive as long as it solves the problem :) Right now I've found that `recent_orders = recent_orders.sample(true, 0.5).limit(1000);` is supposed to do the trick, but I'm open for better solutions! – lte__ Sep 06 '16 at 15:14
  • It's better to calculate fraction if you want to get specified number of rows - just to not waste resources :) I've searched for better solutions, however currently I don't see any. Maybe it will be good to raise JIRA ticket for such functionality – T. Gawęda Sep 07 '16 at 08:19

4 Answers4

56

In Python, You can shuffle the rows and then take the top ones:

import org.apache.spark.sql.functions.rand

dataset.orderBy(rand()).limit(n)
Hadij
  • 3,661
  • 5
  • 26
  • 48
apatry
  • 742
  • 1
  • 5
  • 6
  • 9
    is this implementation efficient? what is the cost of Order by? – Hasson Jul 12 '18 at 23:37
  • 7
    very simple but highly inefficient. If you're happy to have a rough number of rows, better to use a filter vs. a fraction, rather than populating and sorting an entire random vector to get the `n` smallest values – MichaelChirico May 21 '19 at 11:50
17

You can try sample () method. Unfourtunatelly you must give there not a number, but fraction. You can write function like this:

def getRandom (dataset : Dataset[_], n : Int) = {
    val count = dataset.count();
    val howManyTake = if (count > n) n else count;
    dataset.sample(0, 1.0*howManyTake/count).limit (n)
}

Explanation: we must take a fraction of data. If we have 2000 rows and you want to get 100 rows, we must have 0.5 of total rows. If you want to get more rows than there are in DataFrame, you must get 1.0. limit () function is invoked to make sure that rounding is ok and you didn't get more rows than you specified.

Edit: I see in other answer the takeSample method. But remember:

  1. It'a a method of RDD, not Dataset, so you must do: dataset.rdd.takeSample(0, 1000, System.currentTimeMilis()).toDF() takeSample will collect all values.
  2. Remember that if you want to get very many rows then you will have problems with OutOfMemoryError as takeSample is collecting results in driver. Use it carefully
T. Gawęda
  • 15,706
  • 4
  • 46
  • 61
  • 2
    Is there a way to do it without counting the data frame as this operation will be too expensive in large DF. – Hasson Dec 31 '17 at 21:18
  • 2
    @Hasson Try to cache DataFrame, so the second action will be much faster. Or you can also use approxQuantile function, it will be faster but less precise – T. Gawęda Jan 05 '18 at 15:19
  • Giving some margin may help. `df.sample(math.min(1.0, 1.1 * howManyTake / count)).limit(n)` – Hyunjun Kim Feb 07 '22 at 03:05
5

In Pyspark >= 3.1, try this:

sdf.sample(fraction=1.0).limit(n)
s510
  • 2,271
  • 11
  • 18
4

I would prefer this in pyspark

df.sample(withReplacement=False, fraction=desired_fraction)

Here is doc

dheeraj .A
  • 1,073
  • 7
  • 6