I'm facing an issue with the OneHotEncoder of SparkML since it reads dataframe metadata in order to determine the value range it should assign for the sparse vector object its creating.
More specifically, I'm encoding a "hour" field using a training set containing all individual values between 0 and 23.
Now I'm scoring a single row data frame using the "transform" method od the Pipeline.
Unfortunately, this leads to a differently encoded sparse vector object for the OneHotEncoder
(24,[5],[1.0]) vs. (11,[10],[1.0])
I've documented this here, but this was identified as duplicate. So in this thread there is a solution posted to update the dataframes's metadata to reflect the real range of the "hour" field:
from pyspark.sql.functions import col
meta = {"ml_attr": {
"vals": [str(x) for x in range(6)], # Provide a set of levels
"type": "nominal",
"name": "class"}}
loaded.transform(
df.withColumn("class", col("class").alias("class", metadata=meta)) )
Unfortunalely I get this error:
TypeError: alias() got an unexpected keyword argument 'metadata'