1

I am trying to use mmlspark package in pyspark and not able to import the model.

My jupyter notebook is connected to the cluster. I have included the package details in my sparksession as below. In spark UI connected to the cluster I can see the jars added in spark.yarn.dist.jars. But we I import mmlspark inside the notebook - I get a message package not found. Is there something I am missing. Thanks

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
conf = (SparkConf() \
        .setAppName("dataPipeline") \
        .set("spark.jars.packages", "Azure:mmlspark:0.13")
        .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
        .set("spark.dynamicAllocation.enabled", "False") \
        .set("spark.executor.memory","8g") \
        .set("spark.driver.memory","4g"))
spark = SparkSession.builder \
.master("yarn") \
.config(conf=conf) \
.enableHiveSupport() \
.getOrCreate()
Naveenan
  • 345
  • 1
  • 4
  • 16
  • `spark.jars.packages` has to be set before JVM is initialized and if I am not mistaken `SparkConf()` already starts JVM. – Alper t. Turker Jul 17 '18 at 18:03
  • Tried the following still getting error when I import mmlspark os.environ["PYSPARK_SUBMIT_ARGS"] = \ "--packages Azure:mmlspark:0.13 \ pyspark-shell" import findspark findspark.add_packages(["Azure:mmlspark:0.13"]) – Naveenan Jul 17 '18 at 21:57

0 Answers0