In the following two examples, the number of tasks run and the corresponding run time imply that the sampling options have no effect, as they are similar to jobs run without any sampling options:
val df = spark.read.options("samplingRatio",0.001).json("s3a://test/*.json.bz2")
val df = spark.read.option("sampleSize",100).json("s3a://test/*.json.bz2")
I know that explicit schemas are best for performance, but in convenience cases sampling is useful.
New to Spark, am I using these options incorrectly? Attempted the same approach in PySpark, with same results:
df = spark.read.options(samplingRatio=0.1).json("s3a://test/*.json.bz2")
df = spark.read.options(samplingRatio=None).json("s3a://test/*.json.bz2")