I'm trying to execute a simple random sample with Scala from an existing table, containing around 100e6 records.
import org.apache.spark.sql.SaveMode
val nSamples = 3e5.toInt
val frac = 1e-5
val table = spark.table("db_name.table_name").sample(false, frac).limit(nSamples)
(table
.write
.mode(SaveMode.Overwrite)
.saveAsTable("db_name.new_name")
)
But it is taking too long (~5h by my estimates).
Useful information:
I have ~6 workers. By analyzing the number of partitions of the table I get:
11433
.I'm not sure if the partitions/workers ratio is reasonable.
I'm running Spark 2.1.0 using Scala.
I have tried:
Removing the
.limit()
part.Changing
frac
to1.0
,0.1
, etc.
Question: how can I make it faster?
Best,