I am trying to run Logistic regression with a simple data set to understand the syntax of pyspark.
I have data which looks has 11 columns where the first 10 columns are features and the last column(11th column) is the label.
I want to pass these 10 columns as features and the 11th column as label.
But I only know to pass as a single column to pass as a feature using featuresCol="col_header_name"
I have read the data from a csv file using pandas but I have converted it into RDD.
here is the code:
from pyspark.ml.classification import LogisticRegression
from pyspark.sql import SQLContext
from pyspark import SparkContext
import pandas as pd
data = pd.read_csv('abc.csv')
sc = SparkContext("local", "App Name")
sql = SQLContext(sc)
spDF = sql.createDataFrame(data)
tri=LogisticRegression(maxIter=10,regParam=0.01,featuresCol="single_column",labelCol="label")
lr_model = tri.fit(spDF)
if I use featuresCol=[list_of_header_names]
I get errors.
I have used sk-learn which has really simple syntax something like:
reg=LogisticRegression()
reg=reg.fit(Dataframe_of_features,Label_array)