Questions tagged [apache-spark-ml]

Spark ML is a high-level API for building machine learning pipelines in Apache Spark.

Related tags: , ,

External resources:

925 questions
77
votes
3 answers

How do I convert an array (i.e. list) column to Vector

Short version of the question! Consider the following snippet (assuming spark is already set to some SparkSession): from pyspark.sql import Row source_data = [ Row(city="Chicago", temperatures=[-1.0, -2.0, -3.0]), Row(city="New York",…
Arthur Tacca
  • 8,833
  • 2
  • 31
  • 49
70
votes
4 answers

How to split Vector into columns - using PySpark

Context: I have a DataFrame with 2 columns: word and vector. Where the column type of "vector" is VectorUDT. An Example: word | vector assert | [435,323,324,212...] And I want to get this: word | v1 | v2 | v3 | v4 | v5 | v6 ...... assert |…
sedioben
  • 935
  • 1
  • 10
  • 16
53
votes
2 answers

What's the difference between Spark ML and MLLIB packages

I noticed there are two LinearRegressionModel classes in SparkML, one in ML package (spark.ml) and another one in MLLib (spark.mllib) package. These two are implemented quite differently - e.g. the one from MLLib implements Serializable, while the…
vyakhir
  • 1,714
  • 2
  • 17
  • 21
49
votes
5 answers

How to handle categorical features with spark-ml?

How do I handle categorical data with spark-ml and not spark-mllib ? Thought the documentation is not very clear, it seems that classifiers e.g. RandomForestClassifier, LogisticRegression, have a featuresCol argument, which specifies the name of the…
39
votes
3 answers

Column name with dot spark

I am trying to take columns from a DataFrame and convert it to an RDD[Vector]. The problem is that I have columns with a "dot" in their name as the following dataset : "col0.1","col1.2","col2.3","col3.4" 1,2,3,4 10,12,15,3 1,12,10,5 This is what…
38
votes
8 answers

How to extract model hyper-parameters from spark.ml in PySpark?

I'm tinkering with some cross-validation code from the PySpark documentation, and trying to get PySpark to tell me what model was selected: from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import…
Paul
  • 3,321
  • 1
  • 33
  • 42
35
votes
1 answer

Access element of a vector in a Spark DataFrame (Logistic Regression probability vector)

I trained a LogisticRegression model in PySpark (ML package) and the result of the prediction is a PySpark DataFrame (cv_predictions) (see [1]). The probability column (see [2]) is a vector type (see…
user2205916
  • 3,196
  • 11
  • 54
  • 82
35
votes
5 answers

pyspark : NameError: name 'spark' is not defined

I am copying the pyspark.ml example from the official document website: http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.Transformer data = [(Vectors.dense([0.0, 0.0]),), (Vectors.dense([1.0, 1.0]),),(Vectors.dense([9.0,…
35
votes
1 answer

Create a custom Transformer in PySpark ML

I am new to Spark SQL DataFrames and ML on them (PySpark). How can I create a custom tokenizer, which for example removes stop words and uses some libraries from nltk? Can I extend the default one?
Niko
  • 385
  • 1
  • 3
  • 8
34
votes
11 answers

Dropping a nested column from Spark DataFrame

I have a DataFrame with the schema root |-- label: string (nullable = true) |-- features: struct (nullable = true) | |-- feat1: string (nullable = true) | |-- feat2: string (nullable = true) | |-- feat3: string (nullable =…
28
votes
3 answers

How to map features from the output of a VectorAssembler back to the column names in Spark ML?

I'm trying to run a linear regression in PySpark and I want to create a table containing summary statistics such as coefficients, P-values and t-values for each column in my dataset. However, in order to train a linear regression model I had to…
27
votes
6 answers

Serialize a custom transformer using python to be used within a Pyspark ML pipeline

I found the same discussion in comments section of Create a custom Transformer in PySpark ML, but there is no clear answer. There is also an unresolved JIRA corresponding to that: https://issues.apache.org/jira/browse/SPARK-17025. Given that there…
27
votes
5 answers

How to access element of a VectorUDT column in a Spark DataFrame?

I have a dataframe df with a VectorUDT column named features. How do I get an element of the column, say first element? I've tried doing the following from pyspark.sql.functions import udf first_elem_udf = udf(lambda row:…
27
votes
3 answers

How to define a custom aggregation function to sum a column of Vectors?

I have a DataFrame of two columns, ID of type Int and Vec of type Vector (org.apache.spark.mllib.linalg.Vector). The DataFrame looks like follow: ID,Vec 1,[0,0,5] 1,[4,0,1] 1,[1,2,1] 2,[7,5,0] 2,[3,3,4] 3,[0,8,1] 3,[0,0,1] 3,[7,7,7] .... I would…
Rami
  • 8,044
  • 18
  • 66
  • 108
27
votes
1 answer

Encode and assemble multiple features in PySpark

I have a Python class that I'm using to load and process some data in Spark. Among various things I need to do, I'm generating a list of dummy variables derived from various columns in a Spark dataframe. My problem is that I'm not sure how to…
1
2 3
61 62