How to pass multiple Columns as features in a Logistic Regression Classifier in Spark?

Question

I am trying to run Logistic regression with a simple data set to understand the syntax of pyspark. I have data which looks has 11 columns where the first 10 columns are features and the last column(11th column) is the label. I want to pass these 10 columns as features and the 11th column as label. But I only know to pass as a single column to pass as a feature using featuresCol="col_header_name" I have read the data from a csv file using pandas but I have converted it into RDD. here is the code:

from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SQLContext
from pyspark import SparkContext
import pandas as pd
data = pd.read_csv('abc.csv')
sc = SparkContext("local", "App Name")
sql = SQLContext(sc)
spDF = sql.createDataFrame(data)
tri=LogisticRegression(maxIter=10,regParam=0.01,featuresCol="single_column",labelCol="label")
lr_model = tri.fit(spDF)

if I use featuresCol=[list_of_header_names] I get errors. I have used sk-learn which has really simple syntax something like:

reg=LogisticRegression()
reg=reg.fit(Dataframe_of_features,Label_array)

`TypeError: Invalid param value given for param "featuresCol". Could not convert to string type` Which makes some sense because according to the syntax, `featuresCol="name_of_column"` which is a string. — A-ar, Feb 19 '19 at 07:03
Is that really the error, when you have `featuresCol="single_column"`? — Andronicus, Feb 19 '19 at 07:14

score 5 · Accepted Answer · edited Nov 10 '19 at 00:13

5

You need to combine all the columns into one array of feature using Vector Assembler.

from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=[list_of_header_names],outputCol="features")
spDF = assembler.transform(spDF)

You can then pass that assembled array of all the variables as an input to the logistic regression.

tri=LogisticRegression(maxIter=10,
                       regParam=0.01,
                       featuresCol="features",
                       labelCol="label")
lr_model = tri.fit(spDF)

edited Nov 10 '19 at 00:13

Daniel Schneider

1,797
7
20

answered Feb 19 '19 at 07:18

pratiklodha

1,095
12
20

It works, Thanks! Just one more thing.what are MaxIter, RegParam and ElasticNetParam ? – A-ar Feb 19 '19 at 08:05
MaxIter is the number of maximum iteration, RegParam is the regularization parameter. Elastic Net Param specifies if you want the loss function to be L1 or L2. – pratiklodha Feb 19 '19 at 09:17
1

thanks but I do know the full forms! I wanted to know their purpose. – A-ar Feb 19 '19 at 16:22

How to pass multiple Columns as features in a Logistic Regression Classifier in Spark?

1 Answers1