2

I have a set of 1 million XML files, each of size ~14KB in Azure Blob Storage, mounted in Azure Databricks, and I am trying to use CREATE TABLE, with the expectation of one record for each file.

The Experiment

The content structure of the files is depicted below. For simplicity and performance experimentation, all content of the files except the <ID> element is kept identical.

<OBSERVATION>
  <HEADER>...</HEADER>
  <RESULT>
    <ID>...</ID>
    <VALUES>...</VALUES>
  </RESULT>
</OBSERVATION>

For parsing/deserialization, I am using spark-xml by Databricks. At this moment, I am expecting records having two columns HEADER and RESULT, which is what I am getting.

CREATE TABLE Observations
USING XML
OPTIONS (
  path "/mnt/blobstorage/records/*.xml",
  rowTag "RESULT",
  rootTag "OBSERVATION",
  excludeAttribute True
)

The Problem

The CREATE TABLE statement runs for 5.5 hours (a SQL query having name sql at SQLDriverLocal.scala:87 in the Spark UI) out of which only 1 hour is spent in Spark jobs (in the Jobs tab of the Spark UI).

I have noticed that the cell with the CREATE TABLE command remains stuck at Listing files at "/mnt/blobstorage/records/*.xml" for most of the time. First I thought it is a scaling problem in the storage connector. However, I can run the command on ~500K JSON files of similar size in ~25s (A problem with XML vs JSON?).

I also know that spark-xml reads all the files to infer the schema, which might be the bottleneck. To eliminate this possibility, I tried to:

  • predefine a schema (from only the first XML file)
  • ingest as plaintext without parsing (using the TEXT provider). The same problem persists in both cases.

The same statement runs within 20s for 10K records, and in 30 mins for 200K records. With linear scaling (which is obviously not happening), 1 million records would have been done in ~33 minutes.

My Databricks cluster has 1 worker node and 3 driver nodes, each having 256 GB of RAM and 64 cores, so there should not be a caching bottleneck. I have successfully reproduced the issue in multiple runs over 4 days.

The Question

What am I doing wrong here? If there is some partitioning / clustering I can do during the CREATE TABLE, how do I do it?

Abhra Basak
  • 382
  • 4
  • 13

1 Answers1

2

My guesss is that you are running into a small file problem as you are processing only 15 GB. I would merge the small files in bigger files each ca. 250 MB of size. As your dataset is still small you could do this on the driver. The following code shows this doing a merge on a driver node (without considering optimal filesize):

1. Copy the files from Blob to local file-system and generate a script for file merge:

# copy files from mounted storage to driver local storage
dbutils.fs.cp("dbfs:/mnt/blobstorage/records/", "file:/databricks/driver/temp/records", recurse=True)  

unzipdir= 'temp/records/'
gzipdir= 'temp/gzip/'

# generate shell-script and write it into the local filesystem
script = "cat " + unzipdir + "*.xml > " + gzipdir + """all.xml gzip """ + gzipdir + "all.xml"
dbutils.fs.put("file:/databricks/driver/scripts/makeone.sh", script, True)

2. Run the shell script

%sh
sudo sh ./scripts/makeone.sh

3. Copy the files back to the mounted storage

dbutils.fs.mv("file:/databricks/driver/" + gzipdir, "dbfs:/mnt/mnt/blobstorage/recordsopt/", recurse=True) 

Another important point is that the spark-xml library does a two step approach:

  1. It parses the data to infer the schema. If the parameter samplingRatio is not changed, it does this for the whole dataset. Often it is enough only to do this for a smaller sample, or you can predefine the schema (use the parameter schema for this), then you don' t need this step.
  2. Reading the data.

Finally I would recommend to store the data in parquet, so do the more sophisticated queries on a column based format then directly on the xmls and use the spark-xml lib for this preprocessing step.

Hauke Mallow
  • 2,887
  • 3
  • 11
  • 29
  • Thanks for the suggestion. I will try this approach of concatenating files and update this answer thread. A couple of observations I made - Using a predefined schema (using a dataframe of only the first file) did not help, as the "Listing Files" appears to be the bottleneck. – Abhra Basak Feb 25 '19 at 04:40
  • Also, any idea why this problem did not appear for the ~500K JSON files, they were also small files of 6KB each. Anything that spark-xml does differently from spark-json? – Abhra Basak Feb 25 '19 at 04:41
  • Eventually, the concatenation of files worked. I did the concatenation using Azure Batch, and did not need to use DBUtils. Also, I ended up using JSON instead of XML, and it just worked. **25s** – Abhra Basak Feb 25 '19 at 13:06