How can I read in XML files from S3 bucket on EMR?

Question

I have stored XML files in S3 bucket and want to read them on EMR after typing:

sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "Profile").load(xml_file_path)

It gave me errors:

An error occurred while calling o445.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

you need to specify the path to external jar e.g.: `pyspark --jars spark-xml_2.11-0.6.0.jar`. For more detailed answers have a look at: https://stackoverflow.com/questions/27698111/how-to-add-third-party-java-jars-for-use-in-pyspark — Waqas, Aug 12 '19 at 04:02

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

0

Install the spark-xml library on your running EMR cluster with Spark .Link
Launch a PySpark notebook
Execute the following:

df = spark.read.format('com.databricks.spark.xml').options(rootTag='objects').options(rowTag='object').load("s3://bucket-name/sample.xml")

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 19 '20 at 14:20

Pierre

23
3

How can I read in XML files from S3 bucket on EMR?

1 Answers1