I have a modest-sized xml file (200MB, bz2) that I am loading using spark-xml on an AWS emr cluster with 1 master and two core nodes, each with 8cpus and 32GB RAM.
import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._
val sqlContext = new SQLContext(sc)
val experiment = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "EXPERIMENT")
.load("s3n://bucket/path/meta_experiment_set.xml.bz2")
This load takes quite a while and from what I can tell is done with only one partition. Is it possible to tell spark to partition the file on loading to better use the compute resources? I know I can partition after loading.