OneHotEncoder created sparse vector too short (ApacheSpark, pyspark)

Question

My data set during training contains values for the hour field between 0 and 23:

model = pipeline.fit(df)
prediction = model.transform(df)

So the sparse vector created by the OneHotEncoder looks like this:

(24,[5],[1.0])

Now I want to score the model on a single row data set df2 with hour set to 10:

model = pipeline.fit(df)
prediction = model.transform(df2)

The obtained sparse vector object for hour now looks like this:

(11,[10],[1.0])

Therefore when trying to score on a model using an hour value in a range outside the training range I get this error:

Caused by: java.lang.IndexOutOfBoundsException: 21 not in [0,11)

Here the full error message.

But note that I'm calling pipeline.fit using a data frame containing the whole range for hour:

df.agg({"hour": "min"}).show()
df.agg({"hour": "max"}).show()



 +---------+ 
 |min(hour)| 
 +---------+ 
 |        0| 
 +---------+ 

 +---------+ 
 |max(hour)| 
 +---------+ 
 |       23| 
 +---------+

So is there a way to give OneHotEncoder a hint on the range of the encoded vectors? Or is there any better way of doing this?

EDIT October 9th

I've been informed that there exists a solution to my problem in this thread. But unfortunately I get this error when trying the python solution:

TypeError: alias() got an unexpected keyword argument 'metadata'

I'm on Spark V2.1 (on IBM DataScience Experience, so can't upgrade to V2.2, but have to wait...)

@user6910411 the solution on the duplicate thread doesn't work for me. I've documented this in this question, can you please reopen? — Romeo Kienzler, Oct 09 '17 at 20:25

OneHotEncoder created sparse vector too short (ApacheSpark, pyspark)

0 Answers0

Linked