2

I have a modest-sized xml file (200MB, bz2) that I am loading using spark-xml on an AWS emr cluster with 1 master and two core nodes, each with 8cpus and 32GB RAM.

import org.apache.spark.sql.SQLContext
import com.databricks.spark.xml._

val sqlContext = new SQLContext(sc)
val experiment = sqlContext.read
  .format("com.databricks.spark.xml")
  .option("rowTag", "EXPERIMENT")
  .load("s3n://bucket/path/meta_experiment_set.xml.bz2")

This load takes quite a while and from what I can tell is done with only one partition. Is it possible to tell spark to partition the file on loading to better use the compute resources? I know I can partition after loading.

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
seandavi
  • 2,818
  • 4
  • 25
  • 52

1 Answers1

3

You can repartition to increase the parallelism:

experiment.repartition(200)

where 200 is whatever nbr of executor you want to use.

See repartition

Xavier Guihot
  • 54,987
  • 21
  • 291
  • 190
  • Is there a way to read the file in parallel? I am using bz2 compression in hopes that the data can be read in parallel. Seems silly to wait an hour just to read and compute the schema if that could be parallelized. – seandavi Feb 15 '18 at 22:31
  • 1
    Contrary to .gz compressed files, .bz2 files can be uncompressed in parallel. see [this answer](https://stackoverflow.com/a/37447669/9297144) – Xavier Guihot Feb 15 '18 at 22:34