0

I have stored XML files in S3 bucket and want to read them on EMR after typing:

sqlContext.read.format("com.databricks.spark.xml").option("rowTag", "Profile").load(xml_file_path)

It gave me errors:

An error occurred while calling o445.load. : java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

Pierre
  • 23
  • 3
Kevin Wu
  • 21
  • 5
  • you need to specify the path to external jar e.g.: `pyspark --jars spark-xml_2.11-0.6.0.jar`. For more detailed answers have a look at: https://stackoverflow.com/questions/27698111/how-to-add-third-party-java-jars-for-use-in-pyspark – Waqas Aug 12 '19 at 04:02

1 Answers1

0
  1. Install the spark-xml library on your running EMR cluster with Spark .Link

  2. Launch a PySpark notebook

  3. Execute the following:

df = spark.read.format('com.databricks.spark.xml').options(rootTag='objects').options(rowTag='object').load("s3://bucket-name/sample.xml")

Community
  • 1
  • 1
Pierre
  • 23
  • 3