Highest Voted 'apache-spark-mllib' Questions

57

votes

2 answers

What is the difference between Apache Mahout and Apache Spark's MLlib?

Considering a MySQL products database with 10 millions products for an e-commerce website. I'm trying to set up a classification module to categorize products. I'm using Apache Sqoop to import data from MySQL to Hadoop. I wanted to use Mahout over…

apache-spark mahout apache-spark-mllib

asked May 07 '14 at 07:30

eliasah

39,588
11
124
154

53

votes

2 answers

What's the difference between Spark ML and MLLIB packages

I noticed there are two LinearRegressionModel classes in SparkML, one in ML package (spark.ml) and another one in MLLib (spark.mllib) package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable, while the…

apache-spark apache-spark-mllib apache-spark-ml

asked Aug 08 '16 at 18:10

vyakhir

1,714
2
17
21

51

votes

2 answers

AttributeError: 'DataFrame' object has no attribute 'map'

I wanted to convert the spark data frame to add using the code below: from pyspark.mllib.clustering import KMeans spark_df = sqlContext.createDataFrame(pandas_df) rdd = spark_df.map(lambda data: Vectors.dense([float(c) for c in data])) model =…

python apache-spark pyspark apache-spark-sql apache-spark-mllib

asked Sep 16 '16 at 15:44

Edamame

23,718
73
186
320

50

votes

5 answers

How to assign unique contiguous numbers to elements in a Spark RDD

I have a dataset of (user, product, review), and want to feed it into mllib's ALS algorithm. The algorithm needs users and products to be numbers, while mine are String usernames and String SKUs. Right now, I get the distinct users and SKUs, then…

apache-spark apache-spark-mllib

asked May 29 '14 at 17:19

Dilum Ranatunga

13,254
3
41
52

49

votes

5 answers

How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the…

apache-spark categorical-data apache-spark-ml apache-spark-mllib

asked Aug 28 '15 at 18:28

Rainmaker

1,181
2
11
11

47

votes

2 answers

Optimal way to create a ml pipeline in Apache Spark for dataset with high number of columns

I am working with Spark 2.1.1 on a dataset with ~2000 features and trying to create a basic ML Pipeline, consisting of some Transformers and a Classifier. Let's assume for the sake of simplicity that the Pipeline I am working with consists of a…

scala apache-spark apache-spark-mllib

asked May 11 '17 at 09:35

aMKa

605
5
10

46

votes

4 answers

How to serve a Spark MLlib model?

I'm evaluating tools for production ML based applications and one of our options is Spark MLlib , but I have some questions about how to serve a model once its trained? For example in Azure ML, once trained, the model is exposed as a web service…

apache-spark machine-learning apache-spark-mllib

asked Nov 10 '16 at 17:24

Luis Leal

3,388
5
26
49

45

votes

1 answer

Calling Java/Scala function from a task

Background My original question here was Why using DecisionTreeModel.predict inside map function raises an exception? and is related to How to generate tuples of (original lable, predicted label) on Spark with MLlib? When we use Scala API a…

python scala apache-spark pyspark apache-spark-mllib

asked Jul 28 '15 at 18:54

zero323

322,348
103
959
935

43

votes

1 answer

Out-of-core processing of sparse CSR arrays

How can one apply some function in parallel on chunks of a sparse CSR array saved on disk using Python? Sequentially this could be done e.g. by saving the CSR array with joblib.dump opening it with joblib.load(.., mmap_mode="r") and processing the…

python scipy apache-spark-mllib dask joblib

asked Jul 17 '17 at 13:20

rth

10,680
7
53
77

39

votes

3 answers

Column name with dot spark

I am trying to take columns from a DataFrame and convert it to an RDD[Vector]. The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3","col3.4" 1,2,3,4 10,12,15,3 1,12,10,5 This is what…

scala apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

asked Jun 05 '17 at 10:33

Maher HTB

737
3
9
23

38

votes

8 answers

How to extract model hyper-parameters from spark.ml in PySpark?

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import…

pyspark modeling cross-validation apache-spark-mllib apache-spark-ml

asked Apr 18 '16 at 14:46

Paul

3,321
1
33
42

36

votes

3 answers

Dealing with unbalanced datasets in Spark MLlib

I'm working on a particular binary classification problem with a highly unbalanced dataset, and I was wondering if anyone has tried to implement specific techniques for dealing with unbalanced datasets (such as SMOTE) in classification problems…

apache-spark machine-learning classification apache-spark-mllib

asked Oct 27 '15 at 16:04

dbakr

363
1
4
4

35

votes

1 answer

The value of "spark.yarn.executor.memoryOverhead" setting?

The value of spark.yarn.executor.memoryOverhead in a Spark job with YARN should be allocated to App or just the max value?

apache-spark apache-spark-sql spark-streaming apache-spark-mllib

asked Dec 09 '16 at 09:40

liyong

377
1
3
9

35

votes

3 answers

How to create correct data frame for classification in Spark ML

I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample…

scala apache-spark apache-spark-sql apache-spark-mllib

asked Jun 24 '15 at 14:05

Dusan Grubjesic

945
2
9
16

29

votes

11 answers

How to extract best parameters from a CrossValidatorModel

I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x, In Pipeline Example in Spark documentation, they add different parameters (numFeatures, regParam) by using ParamGridBuilder in the Pipeline.…

scala apache-spark pipeline cross-validation apache-spark-mllib

asked Jul 31 '15 at 15:12

Mohammad

1,006
2
15
29

Questions tagged [apache-spark-mllib]