I have a pyspark dataframe where multiple columns contain arrays of different lengths. I want to iterate through the relevant columns and clip the arrays in each row so that they are the same length. In this example, length of 3.
This is an example dataframe:
id_1|id_2|id_3| timestamp |thing1 |thing2 |thing3
A |b | c |[time_0,time_1,time_2]|[1.2,1.1,2.2]|[1.3,1.5,2.6|[2.5,3.4,2.9]
A |b | d |[time_0,time_1] |[5.1,6.1, 1.4, 1.6] |[5.5,6.2, 0.2] |[5.7,6.3]
A |b | e |[time_0,time_1] |[0.1,0.2, 1.1] |[0.5,0.3, 0.3] |[0.9,0.6, 0.9, 0.4]
So far I have,
def clip_func(x, ts_len, backfill=1500):
template = [backfill]*ts_len
template[-len(x):] = x
x = template
return x[-1 * ts_len:]
clip = udf(clip_func, ArrayType(DoubleType()))
for c in [x for x in example.columns if 'thing' in x]:
missing_fill = 3.3
ans = ans.withColumn(c, clip(c, 3, missing_fill))
But is not working. If the array is too short, I want to fill the array with the missing_fill value.