2

I'm trying load a random sample of a total of 100mill row data in s3. Is there an easy way to load a random sample from s3 in to pyspark dataframe directly?

In pandas this would like this df = pandas.read_csv(filename, skiprows=skiplines)

Is there an equivalent in pyspark i could use?

JanBennk
  • 277
  • 7
  • 16
  • See also: https://stackoverflow.com/questions/24806084/sampling-a-large-distributed-data-set-using-pyspark-spark/24809595 – pault Feb 28 '18 at 20:51

1 Answers1

2

I believe that spark's DataFrameReader.csv is lazy by default [citation needed].

So, you should be able to read the csv and use pyspark.sql.DataFrame.sample:

frac = 0.01 # get approximately 1%
df = spark.read.csv(filename)
sample = df.sample(withReplacement=False, fraction=frac)

But nothing actually executes until you apply a transformation.

pault
  • 41,343
  • 15
  • 107
  • 149
  • So to understand you correctly: The full dataset in s3 is NOT loaded in to memory before sampling? It only loads the randomly selected (1% in this case) in to memory? – JanBennk Mar 07 '18 at 20:15
  • TBH, I can't say for sure but I _believe_ this is the case. Do you have a limitation? Is it something you can test to see if it works for your case? – pault Mar 07 '18 at 20:24