Questions tagged [apache-spark-xml]
81 questions
8
votes
2 answers
Why is repartition faster than partitionBy in Spark?
I am attempting to use Spark for a very simple use case: given a large set of files (90k) with device time-series data for millions of devices group all of the time-series reads for a given device into a single set of files (partition). For now…

Robin Zimmerman
- 593
- 1
- 6
- 17
8
votes
3 answers
Read XML in spark
I am trying to read xml/nested xml in pyspark using spark-xml jar.
df = sqlContext.read \
.format("com.databricks.spark.xml")\
.option("rowTag", "hierachy")\
.load("test.xml"
when I execute, data frame is not creating properly.
…

LUZO
- 1,019
- 4
- 19
- 42
8
votes
3 answers
Out of Memory Error when Reading large file in Spark 2.1.0
I want to use spark to read a large (51GB) XML file (on an external HDD) into a dataframe (using spark-xml plugin), do simple mapping / filtering, reordering it and then writing it back to disk, as a CSV file.
But I always get a…

Felipe
- 11,557
- 7
- 56
- 103
5
votes
1 answer
How to Generate a complex XML using Spark-Xml
I am trying to generate a complex xml from my JavaRDd< Book > and JavaRdd< Reviews > How can i club these two to generate below xml ?
test
…

Punith Raj
- 2,164
- 3
- 27
- 45
5
votes
2 answers
How to parse a dataframe containing xml strings?
How to parse xml file containing xml data within one of it's column itself?
In one of our project, we receive xml files, in which some of the columns store another xml. While loading this data to dataframe, the inner xml is getting converted to…

Gourav Dutta
- 533
- 4
- 10
4
votes
1 answer
Why does Spark-XML on AWS Glue fail with AbstractMethodError?
I have an AWS Glue job written in Python that pulls in the spark-xml library (through the Dependent jars path). I'm using spark-xml_2.11-0.2.0.jar. When I try to output my DataFrame to XML I get an error. The code I'm using…

user472292
- 1,069
- 2
- 22
- 37
3
votes
2 answers
Spark-XML sort Dataframe schema by default
I´m triying to read a SAP ABAB XML via Spark using Databricks 'Spark-XML' jar.
My problem is the output dataframe schema is sorted alphabetically by default, I want to mantain the XML schema order.
XML file:

Cir02
- 107
- 7
3
votes
1 answer
How can I expand an Array in a Dataframe in Scala/Spark
I used Databricks spark-xml package to read a XML file into Spark. The file hast the following datastructure:
Thriller
2000-10-01
2020-10-01
…

JanusJato
- 33
- 1
- 4
2
votes
1 answer
How to write Pyspark DataFrame to XML Format?
I'm working on a Glue ETL Job that basically reads a dataframe in Pyspark and should output data in XML Format.
I've searched a lot for the solution and the code fails at the particular write statement shown…

Data_Manipulator_07
- 55
- 7
2
votes
1 answer
How to ignore comments while reading an XML file in Pyspark Databricks?
I am trying to read an xml file in Azure Databricks Notebook in PySpark.
The problem is that my persons.xml has some comments in the beginning.
I just want to ignore them while reading the file.
df = spark.read
…

Naman Sinha
- 72
- 5
2
votes
0 answers
Spark-xml vs SAX Parser vs DOM parser, which one is better?
I'm exploring the XML processing in different technologies, we already have some codes, in Java we used SAX Parser, and in Spark we're spark-xml from the databricks. Now I'm trying to find out the pros and cons of each parser under certain…

DINGJOY
- 51
- 5
2
votes
0 answers
Spark to read BlobStorage files "java.io.IOException: No FileSystem for scheme: https"
Currently, I'm using the azure-storage-blob and hadoop-azure packages for downloading files from a Blob Storage to local.
...
String url = "https://blob_storage_url";
String filename = url.replaceFirst("https.*/", "");
// Setup the cloud storage…

JRH
- 21
- 1
2
votes
1 answer
Databricks Spark CREATE TABLE takes forever for 1 million small XML files
I have a set of 1 million XML files, each of size ~14KB in Azure Blob Storage, mounted in Azure Databricks, and I am trying to use CREATE TABLE, with the expectation of one record for each file.
The Experiment
The content structure of the files is…

Abhra Basak
- 382
- 4
- 13
2
votes
0 answers
Spark XML API - Text between tags
Using Spark XML I'm trying to get to the text that appears between 2 elements within a root element. For example:
x cannot be seen y neither
I would like to get to the text between b elements (text cannot be seen and text…

APM
- 21
- 2