Highest Voted 'apache-spark-ml' Questions

77

votes

3 answers

How do I convert an array (i.e. list) column to Vector

Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York",…

python apache-spark pyspark apache-spark-sql apache-spark-ml

asked Feb 09 '17 at 13:49

Arthur Tacca

8,833
2
31
49

70

votes

4 answers

How to split Vector into columns - using PySpark

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT. An Example: word | vector assert | [435,323,324,212...] And I want to get this: word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert |…

python apache-spark pyspark apache-spark-sql apache-spark-ml

asked Jul 14 '16 at 21:12

sedioben

935
1
10
16

53

votes

2 answers

What's the difference between Spark ML and MLLIB packages

I noticed there are two LinearRegressionModel classes in SparkML, one in ML package (spark.ml) and another one in MLLib (spark.mllib) package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable, while the…

apache-spark apache-spark-mllib apache-spark-ml

asked Aug 08 '16 at 18:10

vyakhir

1,714
2
17
21

49

votes

5 answers

How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the…

apache-spark categorical-data apache-spark-ml apache-spark-mllib

asked Aug 28 '15 at 18:28

Rainmaker

1,181
2
11
11

39

votes

3 answers

Column name with dot spark

I am trying to take columns from a DataFrame and convert it to an RDD[Vector]. The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3","col3.4" 1,2,3,4 10,12,15,3 1,12,10,5 This is what…

scala apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

asked Jun 05 '17 at 10:33

Maher HTB

737
3
9
23

38

votes

8 answers

How to extract model hyper-parameters from spark.ml in PySpark?

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import…

pyspark modeling cross-validation apache-spark-mllib apache-spark-ml

asked Apr 18 '16 at 14:46

Paul

3,321
1
33
42

35

votes

1 answer

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)

I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame (cv_predictions) (see [1]). The probability column (see [2]) is a vector type (see…

python apache-spark pyspark apache-spark-sql apache-spark-ml

asked Jun 08 '17 at 01:17

user2205916

3,196
11
54
82

35

votes

5 answers

pyspark : NameError: name 'spark' is not defined

I am copying the pyspark.ml example from the official document website: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0,…

apache-spark machine-learning pyspark distributed-computing apache-spark-ml

asked Sep 16 '16 at 23:05

Edamame

23,718
73
186
320

35

votes

1 answer

Create a custom Transformer in PySpark ML

I am new to Spark SQL DataFrames and ML on them (PySpark). How can I create a custom tokenizer, which for example removes stop words and uses some libraries from nltk? Can I extend the default one?

python apache-spark nltk pyspark apache-spark-ml

asked Sep 01 '15 at 12:36

Niko

385
1
3
8

34

votes

11 answers

Dropping a nested column from Spark DataFrame

scala apache-spark dataframe apache-spark-sql apache-spark-ml

asked Sep 22 '15 at 21:30

Nikhil J Joshi

1,177
2
12
25

28

votes

3 answers

How to map features from the output of a VectorAssembler back to the column names in Spark ML?

I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset. However, in order to train a linear regression model I had to…

python apache-spark machine-learning pyspark apache-spark-ml

asked Mar 21 '17 at 18:56

charmander

957
1
8
21

27

votes

6 answers

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025. Given that there…

apache-spark pyspark apache-spark-mllib apache-spark-ml

asked Dec 30 '16 at 16:25

TechnoIndifferent

1,034
1
10
10

27

votes

5 answers

How to access element of a VectorUDT column in a Spark DataFrame?

I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element? I've tried doing the following from pyspark.sql.functions import udf first_elem_udf = udf(lambda row:…

apache-spark dataframe pyspark apache-spark-sql apache-spark-ml

asked Sep 18 '16 at 09:00

Christian Alis

6,556
5
31
29

27

votes

3 answers

How to define a custom aggregation function to sum a column of Vectors?

I have a DataFrame of two columns, ID of type Int and Vec of type Vector (org.apache.spark.mllib.linalg.Vector). The DataFrame looks like follow: ID,Vec 1,[0,0,5] 1,[4,0,1] 1,[1,2,1] 2,[7,5,0] 2,[3,3,4] 3,[0,8,1] 3,[0,0,1] 3,[7,7,7] .... I would…

scala apache-spark apache-spark-sql aggregate-functions apache-spark-ml

asked Nov 24 '15 at 17:21

Rami

8,044
18
66
108

27

votes

1 answer

Encode and assemble multiple features in PySpark

I have a Python class that I'm using to load and process some data in Spark. Among various things I need to do, I'm generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I'm not sure how to…

python apache-spark apache-spark-sql apache-spark-mllib apache-spark-ml

asked Oct 07 '15 at 01:40

moustachio

2,924
3
36
68

Questions tagged [apache-spark-ml]