1

My question is based upon this.

  1. Would it be possible more detailed comments/explain code starting line tf = HashingTF().transform( training_raw.map(lambda doc: doc["text"], preservesPartitioning=True))
  2. How could I print the confusion matrix?
  3. What does below error mean? How can I fix it? The model still gets built and I get predictions

    >>> # Train and check ... model = NaiveBayes.train(training) [Stage 2:=============================> (2 + 2) / 4]16/04/05 18:18:28 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/04/05 18:18:28 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS

  4. How could I print results for the new observation. I tried and failed

    >>> model.predict("love") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\spark-1.6.1-bin-hadoop2.6\spark-1.6.1-bin-hadoop2.6\python\pyspark\mllib\classification.py", line 594, in predict x = _convert_to_vector(x) File "c:\spark-1.6.1-bin-hadoop2.6\spark-1.6.1-bin-hadoop2.6\python\pyspark\mllib\linalg\__init__.py", line 77, in _convert_to_vector raise TypeError("Cannot convert type %s into Vector" % type(l)) TypeError: Cannot convert type <class 'str'> into Vector

Community
  • 1
  • 1
user2543622
  • 5,760
  • 25
  • 91
  • 159

1 Answers1

2

1.hashingTF in spark is similiar to the scikitlearn HashingVectorizer. training_raw is an rdd of text.For a detailed explanation of the available vectorizers in pySpark see Vectorizers. For a complete example see this post

2.BLAS is the Basic Linear Algebra Subprograms library. You can check out this page on github for a potential solution.

3.You are trying to use model.predict on a string ("love"). You must first convert the string to a vector. A simple example that takes a dense vector string and outputs a dense vector with label is

def parseLine(line):
    parts = line.split(',')
    label = float(parts[0])
    features = Vectors.dense([float(x) for x in parts[1].split(' ')])
    return LabeledPoint(label, features)

You are probably looking for a sparse vector. So try Vectors.sparse.

Community
  • 1
  • 1
goCards
  • 1,388
  • 9
  • 10
  • for 2, I understand what BLAS stands for now. But would it be possible to provide tips to get rid of the error? Also let me know how to print the confusion matrix...thanks – user2543622 Apr 06 '16 at 02:26