Questions tagged [apache-spark-mllib]

MLlib is a machine learning library for Apache Spark

MLlib is a low level, RDD based machine learning library for Apache Spark

External links:

Related tags:

,

2241 questions
57
votes
2 answers

What is the difference between Apache Mahout and Apache Spark's MLlib?

Considering a MySQL products database with 10 millions products for an e-commerce website. I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop. I wanted to use Mahout over…
eliasah
  • 39,588
  • 11
  • 124
  • 154
53
votes
2 answers

What's the difference between Spark ML and MLLIB packages

I noticed there are two LinearRegressionModel classes in SparkML, one in ML package (spark.ml) and another one in MLLib (spark.mllib) package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable, while the…
vyakhir
  • 1,714
  • 2
  • 17
  • 21
51
votes
2 answers

AttributeError: 'DataFrame' object has no attribute 'map'

I wanted to convert the spark data frame to add using the code below: from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) model =…
Edamame
  • 23,718
  • 73
  • 186
  • 320
50
votes
5 answers

How to assign unique contiguous numbers to elements in a Spark RDD

I have a dataset of (user, product, review), and want to feed it into mllib's ALS algorithm. The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs. Right now, I get the distinct users and SKUs, then…
Dilum Ranatunga
  • 13,254
  • 3
  • 41
  • 52
49
votes
5 answers

How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the…
47
votes
2 answers

Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier. Let's assume for the sake of simplicity that the Pipeline I am working with consists of a…
aMKa
  • 605
  • 5
  • 10
46
votes
4 answers

How to serve a Spark MLlib model?

I'm evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained? For example in Azure ML, once trained, the model is exposed as a web service…
Luis Leal
  • 3,388
  • 5
  • 26
  • 49
45
votes
1 answer

Calling Java/Scala function from a task

Background My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib? When we use Scala API a…
zero323
  • 322,348
  • 103
  • 959
  • 935
43
votes
1 answer

Out-of-core processing of sparse CSR arrays

How can one apply some function in parallel on chunks of a sparse CSR array saved on disk using Python? Sequentially this could be done e.g. by saving the CSR array with joblib.dump opening it with joblib.load(.., mmap_mode="r") and processing the…
rth
  • 10,680
  • 7
  • 53
  • 77
39
votes
3 answers

Column name with dot spark

I am trying to take columns from a DataFrame and convert it to an RDD[Vector]. The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3","col3.4" 1,2,3,4 10,12,15,3 1,12,10,5 This is what…
38
votes
8 answers

How to extract model hyper-parameters from spark.ml in PySpark?

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import…
Paul
  • 3,321
  • 1
  • 33
  • 42
36
votes
3 answers

Dealing with unbalanced datasets in Spark MLlib

I'm working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dealing with unbalanced datasets (such as SMOTE) in classification problems…
35
votes
1 answer

The value of "spark.yarn.executor.memoryOverhead" setting?

The value of spark.yarn.executor.memoryOverhead in a Spark job with YARN should be allocated to App or just the max value?
35
votes
3 answers

How to create correct data frame for classification in Spark ML

I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample…
29
votes
11 answers

How to extract best parameters from a CrossValidatorModel

I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x, In Pipeline Example in Spark documentation, they add different parameters (numFeatures, regParam) by using ParamGridBuilder in the Pipeline.…
Mohammad
  • 1,006
  • 2
  • 15
  • 29
1
2 3
99 100