Load a random sample of csv file s3 pyspark

Question

I'm trying load a random sample of a total of 100mill row data in s3. Is there an easy way to load a random sample from s3 in to pyspark dataframe directly?

In pandas this would like this df = pandas.read_csv(filename, skiprows=skiplines)

Is there an equivalent in pyspark i could use?

See also: https://stackoverflow.com/questions/24806084/sampling-a-large-distributed-data-set-using-pyspark-spark/24809595 — pault, Feb 28 '18 at 20:51

score 2 · Answer 1 · answered Feb 28 '18 at 15:25

2

I believe that spark's DataFrameReader.csv is lazy by default ^{[citation needed]}.

So, you should be able to read the csv and use pyspark.sql.DataFrame.sample:

frac = 0.01 # get approximately 1%
df = spark.read.csv(filename)
sample = df.sample(withReplacement=False, fraction=frac)

But nothing actually executes until you apply a transformation.

answered Feb 28 '18 at 15:25

pault

41,343
15
107
149

So to understand you correctly: The full dataset in s3 is NOT loaded in to memory before sampling? It only loads the randomly selected (1% in this case) in to memory? – JanBennk Mar 07 '18 at 20:15
TBH, I can't say for sure but I _believe_ this is the case. Do you have a limitation? Is it something you can test to see if it works for your case? – pault Mar 07 '18 at 20:24

Load a random sample of csv file s3 pyspark

1 Answers1