Pyspark: Pad Array[Int] column with zeros

Question

I have the following column in a pyspark dataframe, of type Array[Int].

+--------------------+
|     feature_indices|
+--------------------+
|                 [0]|
|[0, 1, 4, 10, 11,...|
|           [0, 1, 2]|
|                 [1]|
|                 [0]|
+--------------------+

I am trying to pad the array with zeros, and then limit the list length, so that the length of each row's array would be the same. For example, for n = 5, I expect:

+--------------------+
|     feature_indices|
+--------------------+
|     [0, 0, 0, 0, 0]|
|   [0, 1, 4, 10, 11]|
|     [0, 1, 2, 0, 0]|
|     [1, 0, 0, 0, 0]|
|     [0, 0, 0, 0, 0]|
+--------------------+

Any suggestions? I looked at pyspark rpad function, but it only operates on string type columns.

score 5 · Accepted Answer · answered May 16 '18 at 20:51

5

You can write a udf to do this:

from pyspark.sql.types import ArrayType, IntegerType
import pyspark.sql.functions as F

pad_fix_length = F.udf(
    lambda arr: arr[:5] + [0] * (5 - len(arr[:5])), 
    ArrayType(IntegerType())
)

df.withColumn('feature_indices', pad_fix_length(df.feature_indices)).show()
+-----------------+
|  feature_indices|
+-----------------+
|  [0, 0, 0, 0, 0]|
|[0, 1, 4, 10, 11]|
|  [0, 1, 2, 0, 0]|
|  [1, 0, 0, 0, 0]|
|  [0, 0, 0, 0, 0]|
+-----------------+

answered May 16 '18 at 20:51

Psidom

209,562
33
339
356

Excellent, thank you! I was struggling with composing the udf properly. – dportman May 16 '18 at 21:44
What if we don't give `ArrayType(IntegerType())` in udf then? – Shadab Hussain Jun 23 '20 at 21:19
Is there a way to do without pandasUDF? Its a costly computation and with size of data what i have, its though to use pandasUDF. – Vigneshwar Thiyagarajan May 25 '22 at 11:36

Nanda · Answer 2 · 2018-05-16T22:30:49.627

I recently used the pad_sequences function within Keras to do something similar. I'm not sure of your usecase so this might be an unnecessarily large dependency to add on.

Anyways, here's the link to the documentation for the function: https://keras.io/preprocessing/sequence/#pad_sequences

from keras.preprocessing.sequence import pad_sequences    

input_sequence =[[1,2,3], [1,2], [1,4]]

padded_sequence = pad_sequences(input_sequence, maxlen=3, padding='post', truncating='post', value=0.0)

print padded_sequence

The output:

[[1 2 3]
 [1 2 0]
 [1 4 0]]

Pyspark: Pad Array[Int] column with zeros

2 Answers2

Linked