I have a set of 1 million XML files, each of size ~14KB in Azure Blob Storage, mounted in Azure Databricks, and I am trying to use CREATE TABLE
, with the expectation of one record for each file.
The Experiment
The content structure of the files is depicted below. For simplicity and performance experimentation, all content of the files except the <ID>
element is kept identical.
<OBSERVATION>
<HEADER>...</HEADER>
<RESULT>
<ID>...</ID>
<VALUES>...</VALUES>
</RESULT>
</OBSERVATION>
For parsing/deserialization, I am using spark-xml by Databricks. At this moment, I am expecting records having two columns HEADER
and RESULT
, which is what I am getting.
CREATE TABLE Observations
USING XML
OPTIONS (
path "/mnt/blobstorage/records/*.xml",
rowTag "RESULT",
rootTag "OBSERVATION",
excludeAttribute True
)
The Problem
The CREATE TABLE
statement runs for 5.5 hours (a SQL query having name sql at SQLDriverLocal.scala:87
in the Spark UI) out of which only 1 hour is spent in Spark jobs (in the Jobs tab of the Spark UI).
I have noticed that the cell with the CREATE TABLE
command remains stuck at Listing files at "/mnt/blobstorage/records/*.xml"
for most of the time. First I thought it is a scaling problem in the storage connector. However, I can run the command on ~500K JSON files of similar size in ~25s (A problem with XML vs JSON?).
I also know that spark-xml
reads all the files to infer the schema, which might be the bottleneck. To eliminate this possibility, I tried to:
- predefine a schema (from only the first XML file)
- ingest as plaintext without parsing (using the
TEXT
provider). The same problem persists in both cases.
The same statement runs within 20s for 10K records, and in 30 mins for 200K records. With linear scaling (which is obviously not happening), 1 million records would have been done in ~33 minutes.
My Databricks cluster has 1 worker node and 3 driver nodes, each having 256 GB of RAM and 64 cores, so there should not be a caching bottleneck. I have successfully reproduced the issue in multiple runs over 4 days.
The Question
What am I doing wrong here? If there is some partitioning / clustering I can do during the CREATE TABLE
, how do I do it?