I have a column with vectors like:
{"vectorType": "dense", "length": 7, "values": [0.6514390482520993, 0.195419935311166, 0.07402558401796841, 0.032660728394633604, 0.01775634445896352, 0.009769471437190208, 0.018928888127979045]}
and I would like to create a UDF function to process this probabilities like so:
@udf("integer")
def process_probability(x):
vector_values = x['values']
if vector_values[0] > 0.65: return 0
elif vector_values[1] > 0.198: return 1
elif vector_values[2] > 0.08: return 2
elif vector_values[3] > 0.035: return 3
elif vector_values[4] > 0.045: return 4
elif vector_values[5] > 0.00976: return 5
elif vector_values[6] > 0.0188: return 6
but when I apply this I get the following error:
>> processed_predictions = predictions.withColumn('processed_prediction', process_probability(predictions.probability))
File "<command-8587444>", line 3, in process_probability
File "/spark/python/pyspark/ml/linalg/__init__.py", line 394, in __getitem__
return self.array[item]
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices