I'm trying load a random sample of a total of 100mill row data in s3. Is there an easy way to load a random sample from s3 in to pyspark dataframe directly?
In pandas this would like this
df = pandas.read_csv(filename, skiprows=skiplines)
Is there an equivalent in pyspark i could use?