How can I increase parallelism with loading large XML file with spark-xml?

Question

I have a modest-sized xml file (200MB, bz2) that I am loading using spark-xml on an AWS emr cluster with 1 master and two core nodes, each with 8cpus and 32GB RAM.

import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._

val sqlContext = new SQLContext(sc)
val experiment = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rowTag", "EXPERIMENT")
  .load("s3n://bucket/path/meta_experiment_set.xml.bz2")

This load takes quite a while and from what I can tell is done with only one partition. Is it possible to tell spark to partition the file on loading to better use the compute resources? I know I can partition after loading.

score 3 · Accepted Answer · answered Feb 15 '18 at 22:27

3

You can repartition to increase the parallelism:

experiment.repartition(200)

where 200 is whatever nbr of executor you want to use.

See repartition

answered Feb 15 '18 at 22:27

Xavier Guihot

54,987
21
291
190

Is there a way to read the file in parallel? I am using bz2 compression in hopes that the data can be read in parallel. Seems silly to wait an hour just to read and compute the schema if that could be parallelized. – seandavi Feb 15 '18 at 22:31
1

Contrary to .gz compressed files, .bz2 files can be uncompressed in parallel. see [this answer](https://stackoverflow.com/a/37447669/9297144) – Xavier Guihot Feb 15 '18 at 22:34

How can I increase parallelism with loading large XML file with spark-xml?

1 Answers1