Jupyter Notebook + Pyspark - trouble loading spark-xml package

Question

from pyspark import SparkContext
sc = SparkContext()
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages --packages com.databricks:spark-xml_2.12:0.5.0 pyspark-shell' 

ds = sqlContext.read.format('com.databricks.spark.xml').option('rowTag', 'row').load('src/main/resources/Tags.xml')

ds.show()

I put the above code into the Jupyter cell and it seems like it doesn't load the 'com.databricks.spark.xml' - package at all. What am I suppose to do be able to load some xml file into the Jupyter combined with pyspark? I am using Manjaro.

The error is:

Py4JJavaError: An error occurred while calling o24.load.
: java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml. Please find packages at http://spark.apache.org/third-party-projects.html

In particular - ["These property can be also set dynamically in your code **before SparkContext / SparkSession and corresponding JVM have been started**"](https://stackoverflow.com/a/33908466/10938362). In your case you start contexts first - that's not going to work. — user10938362, May 19 '19 at 20:29
Thank you! It feels that I'm moving on with this but not quite, at least not yet. There is another error I've occurred and I'm not sure what to do with this. "Py4JJavaError: An error occurred while calling o58.load" and the after this some details: java.lang.BootstrapMethodError: java.lang.NoClassDefFoundError: scala/runtime/java8/JFunction0$mcD$sp - the error happens when the: "ds = sqlContext.read.format('xml').option('rowTag', 'row').load('src/main/resources/Tags.xml')" - is beeing tried to read. — ThatKidMike, May 20 '19 at 15:12
That sounds like a Scala version mismatch. All released versions, excluding 2.4.2, use Scala 2.11, not 2.12. — user10938362, May 20 '19 at 16:35
And I'm using Scala 2.11 at least that's what is beeing displayed when i type in: "spark-submit --version" - it says "Using Scala version 2.11.12". Spark ver. is 2.4.1 — ThatKidMike, May 20 '19 at 17:17
Yet you ask for Scala 2.12 dependency - `com.databricks:spark-xml_2.12:0.5.0`. 2.11 would be `com.databricks:spark-xml_2.11:0.5.0` See also [Resolving dependency problems in Apache Spark](https://stackoverflow.com/q/41383460/10938362). — user10938362, May 20 '19 at 19:20
Thanks! I literally noticed that 10 minutes ago, before you posted your reply. Everything works like a charm now. Thank you again! — ThatKidMike, May 20 '19 at 19:30
@user10938362, nice finding. How do you get to know which spark-xml version belongs to which spark-scala version? is there any mapping documentation? — deathrace, Jul 03 '21 at 13:17

Jupyter Notebook + Pyspark - trouble loading spark-xml package

0 Answers0