0

I need to use com.databricks.spark.xml from a google cloud notebook

tried:

import os
#os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.11:0.6.0 pyspark-shell'
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'

articles_df = spark.read.format('xml'). \
    options(rootTag='articles', rowTag='article'). \
    load('gs://....-20180831.xml', schema=articles_schema)

but I'm getting:

java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.xml

zbeedatm
  • 601
  • 1
  • 15
  • 39
  • This could be an answer : https://stackoverflow.com/questions/33908156/how-to-load-jar-dependenices-in-ipython-notebook/33908466#33908466 – blackbishop Dec 28 '19 at 21:27
  • 1
    If someone will need it, I had to add the spark-xml.jar to "jars" folder of pyspark. If need to run it on DataProc then need to specify it on properties when running the command of creating DataProc from cli: "--properties spark:spark.jars.packages=com.databricks:spark-xml_2.11:0.6.0" – zbeedatm Jan 03 '20 at 14:01

0 Answers0