The Spark Pipeline framework allows for creation of pipelines of transforms for machine learning or other applications in a reproducible way. However, when creating the dataframes, I want to be able to perform exploratory analysis.
In my case, I have ~100 columns, of which 80 are strings and need to be one hot encoded:
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer,VectorAssembler
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import LogisticRegressionModel
#cols_to_one_hot_encode_2 is a list of columns that need to be one hot encoded
#cols_to_keep_as_is are columns that are **note** one hot encoded
cols_to_one_hot_encode_3=[i+"_hot" for i in cols_to_one_hot_encode_2]
encoder= OneHotEncoderEstimator(inputCols=cols_to_one_hot_encode_2,
outputCols=cols_to_one_hot_encode_3,dropLast=False)
#assemble pipeline
vectorAssembler = VectorAssembler().setInputCols(cols_to_keep_as_is+cols_to_one_hot_encode_3).setOutputCol("features")
all_stages=indexers
all_stages.append(encoder)
all_stages.append(vectorAssembler)
transformationPipeline=Pipeline(stages=all_stages)
fittedPipeline=transformationPipeline.fit(df_3)
dataset = fittedPipeline.transform(df_3)
#now pass to logistic regression
selectedcols = ["response_variable","features"] #+df_3.columns
dataset_2= dataset.select(selectedcols)
# Create initial LogisticRegression model
lr = LogisticRegression(labelCol="response_variable", featuresCol="features", maxIter=10,elasticNetParam=1)
# Train model with Training Data
lrModel = lr.fit(dataset_2)
When I look at dataset_2 display(dataset_2)
, it prints:
response_variable features
0 [0,6508,[1,4,53,155,166,186,205,242,2104,6225,6498],[8220,1,1,1,1,1,1,1,1,1,1]]
0 [0,6508,[1,3,53,155,165,185,207,243,2104,6225,6498],[8220,1,1,1,1,1,1,1,1,1,1]]
0 [0,6508,[1,2,53,158,170,185,206,241,2104,6225,6498],[8222,1,1,1,1,1,1,1,1,1,1]]
0 [0,6508,[1,3,53,156,168,185,205,240,2104,6225,6498],[8222,1,1,1,1,1,1,1,1,1,1]]
0 [0,6508,[1,2,53,155,166,185,205,240,2104,6225,6498],[8223,1,1,1,1,1,1,1,1,1,1]]
Which is totally useless for doing feature exploration.Notice that the one-hot encoder has exploded my features from ~100 columns to 6508.
My question
How do I look at the dataframe that iscreated under the hood by the pipeline? This should be a dataframe that has 6058 features and the corresponding number of rows, such as: For example, I want something like:
response_variable feature_1_hot_1 feature_1_hot_2 feature_1_hot_3 ... (6505 more columns)
0 1 1 0
etc.
Not a duplicate
Not a duplicate of How to split Vector into columns - using PySpark That is asking how to do literal string splitting based on a delimiter. The transform done by the pipeline is not a simple string splitting. See Using Spark ML Pipelines just for Transformations