from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages --packages com.databricks:spark-xml_2.12:0.5.0 pyspark-shell'
ds = sqlContext.read.format('com.databricks.spark.xml').option('rowTag', 'row').load('src/main/resources/Tags.xml')
ds.show()
I put the above code into the Jupyter cell and it seems like it doesn't load the 'com.databricks.spark.xml' - package at all. What am I suppose to do be able to load some xml file into the Jupyter combined with pyspark? I am using Manjaro.
The error is:
Py4JJavaError: An error occurred while calling o24.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html