I am having dataframe which has a column of dense vectors i.e. Multiclass classification prediction probabilities. I want to convert that column to numpy array and facing issues of shape mismatch. There are the things I tried.
One answer I found on here did converted the values into numpy array but in original dataframe it had
4653
observations but the shape of numpy array was(4712, 21)
. I dont understand how it increased and in another attempt with same code numpy array shape desreased the the count of original dataframe. I dont understand why?I also tried
predictions.select("probability").toPandas().values.shape
but again the shape was mismatched. I usedcount()
method of pyspark dataframe to check the lenght of dataframe.I also tried UTF with
toArray()
method of column of pyspark dataframe which resulted in strange error like thisorg.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 116.0 failed 4 times, most recent failure: Lost task 2.3 in stage 116.0 (TID 6254, 10.2.1.54, executor 0): net.razorvine.pickle.PickleException: expected zero arguments for construction of ClassDict (for numpy.core.multiarray._reconstruct)
Here is what I am doing
rf = RandomForestClassifier(
featuresCol="features",
labelCol=TARGET_COL,
predictionCol=TARGET_COL + "_predicted",
# impurity="entropy"
# maxDepth=5,
# numTrees=1000,
# minInfoGain=0.2,
# subsamplingRate=0.8
)
evaluator = MulticlassClassificationEvaluator(
predictionCol=TARGET_COL + "_predicted",
labelCol=TARGET_COL,
metricName="accuracy"
)
paramGrid = ParamGridBuilder(). \
addGrid(rf.maxDepth, [3, 5, 7, 9, 11]). \
addGrid(rf.numTrees, [20, 50, 100, 200, 500]). \
addGrid(rf.minInfoGain, [0.0, 0.2, 0.5, 1.0]). \
addGrid(rf.subsamplingRate, [0.5, 0.8, 1.0]). \
addGrid(rf.impurity, ["entropy", "gini"]). \
build()
paramGrid = ParamGridBuilder(). \
addGrid(rf.maxDepth, [3]). \
addGrid(rf.numTrees, [2]). \
addGrid(rf.minInfoGain, [0.0]). \
addGrid(rf.subsamplingRate, [0.5]). \
addGrid(rf.impurity, ["entropy"]). \
build()
tvs = TrainValidationSplit(estimator=rf,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
trainRatio=0.8)
print("~~~~~~~~~~~ Model Training Started ~~~~~~~~~~~")
model = tvs.fit(train_df)
best_model = model.bestModel
print(best_model._java_obj.parent().getImpurity())
print(best_model._java_obj.parent().getMaxDepth())
print(best_model._java_obj.parent().getNumTrees())
print(best_model._java_obj.parent().getMinInfoGain())
print(best_model._java_obj.parent().getSubsamplingRate())
prob_array = []
predictions = model.transform(test_df)
print(predictions.count())
print(test_df.count())
pprint(predictions.select("probability").head(1)[0].probability)
pprint(predictions.select("probability").head(1)[0].probability.toArray())
pprint(type(predictions.select("probability").head(1)[0].probability.toArray()))
pprint(predictions.select("probability").head(1)[0].probability.toArray().shape)
print(predictions.select("probability").count())
print(predictions.select("probability").toPandas())
print(predictions.select("probability").toPandas().values.shape)