0

I'm trying to import the KMeans and Vectors classes from spark.mllib. The platform is IBM Cloud (DSX) with python 3.5 and a Junyper Notebook.

I've tried:

import org.apache.spark.mllib.linalg.Vectors
import apache.spark.mllib.linalg.Vectors
import spark.mllib.linalg.Vectors

I've found several examples/tutorials with the first import working for the author. I've was able to confirm that the spark library itself isn't loaded in the environment. Normally, I would download the package and then import. But being new to VMs, I'm not sure how to make this happen.

I've also tried pip install spark without luck. It throws an error that reads:

The following command must be run outside of the IPython shell:

    $ pip install spark

The Python package manager (pip) can only be used from outside of IPython.
Please reissue the `pip` command in a separate terminal or command prompt.

But this is in a VM where I don't see the ability to externally access the CLI.

I did find this, but I don't think I have a mismatch problem -- the issue on importing into DSX is covered but I can't quite interpret it for my situation.

I think this is the actual issue I'm having but it is for sparkR and not python.

Bill Armstrong
  • 1,615
  • 3
  • 23
  • 47

2 Answers2

0

It looks like you are trying to use Scala code in a Python notebook.

To get the spark session:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

This will print the version of Spark:

spark.version

To import the ML libraries:

from pyspark.ml import Pipeline
from pyspark.ml.clustering import KMeans
from pyspark.ml.clustering import KMeansModel
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors

Note: This uses the spark.ml package. The spark.mllib package is the RDD-based library and is currently in maintenance mode. The primary ML library is now spark.ml (DataFrame-based).

https://spark.apache.org/docs/latest/ml-guide.html

markwatsonatx
  • 3,391
  • 2
  • 21
  • 19
  • This was the push I needed. I discovered that you'll still need to install pyspark with `!pip install --user pyspark` (the ! giving access at the jupyter level). I noticed that both `mllib` and `ml` work, just with different syntax - but will recode to `ml`. Thanks. – Bill Armstrong Apr 01 '18 at 16:20
0

DSX environments don't have Spark. When you create a new notebook, you have to decide whether it runs in one of the new environments, without Spark, or in the Spark backend.

Roland Weber
  • 1,865
  • 2
  • 17
  • 27