My data set during training contains values for the hour field between 0 and 23:
model = pipeline.fit(df)
prediction = model.transform(df)
So the sparse vector created by the OneHotEncoder looks like this:
(24,[5],[1.0])
Now I want to score the model on a single row data set df2 with hour set to 10:
model = pipeline.fit(df)
prediction = model.transform(df2)
The obtained sparse vector object for hour now looks like this:
(11,[10],[1.0])
Therefore when trying to score on a model using an hour value in a range outside the training range I get this error:
Caused by: java.lang.IndexOutOfBoundsException: 21 not in [0,11)
But note that I'm calling pipeline.fit using a data frame containing the whole range for hour:
df.agg({"hour": "min"}).show()
df.agg({"hour": "max"}).show()
+---------+
|min(hour)|
+---------+
| 0|
+---------+
+---------+
|max(hour)|
+---------+
| 23|
+---------+
So is there a way to give OneHotEncoder a hint on the range of the encoded vectors? Or is there any better way of doing this?
EDIT October 9th
I've been informed that there exists a solution to my problem in this thread. But unfortunately I get this error when trying the python solution:
TypeError: alias() got an unexpected keyword argument 'metadata'
I'm on Spark V2.1 (on IBM DataScience Experience, so can't upgrade to V2.2, but have to wait...)