5

I have the following column in a pyspark dataframe, of type Array[Int].

+--------------------+
|     feature_indices|
+--------------------+
|                 [0]|
|[0, 1, 4, 10, 11,...|
|           [0, 1, 2]|
|                 [1]|
|                 [0]|
+--------------------+

I am trying to pad the array with zeros, and then limit the list length, so that the length of each row's array would be the same. For example, for n = 5, I expect:

+--------------------+
|     feature_indices|
+--------------------+
|     [0, 0, 0, 0, 0]|
|   [0, 1, 4, 10, 11]|
|     [0, 1, 2, 0, 0]|
|     [1, 0, 0, 0, 0]|
|     [0, 0, 0, 0, 0]|
+--------------------+

Any suggestions? I looked at pyspark rpad function, but it only operates on string type columns.

dportman
  • 1,101
  • 10
  • 20

2 Answers2

5

You can write a udf to do this:

from pyspark.sql.types import ArrayType, IntegerType
import pyspark.sql.functions as F

pad_fix_length = F.udf(
    lambda arr: arr[:5] + [0] * (5 - len(arr[:5])), 
    ArrayType(IntegerType())
)

df.withColumn('feature_indices', pad_fix_length(df.feature_indices)).show()
+-----------------+
|  feature_indices|
+-----------------+
|  [0, 0, 0, 0, 0]|
|[0, 1, 4, 10, 11]|
|  [0, 1, 2, 0, 0]|
|  [1, 0, 0, 0, 0]|
|  [0, 0, 0, 0, 0]|
+-----------------+
Psidom
  • 209,562
  • 33
  • 339
  • 356
0

I recently used the pad_sequences function within Keras to do something similar. I'm not sure of your usecase so this might be an unnecessarily large dependency to add on.

Anyways, here's the link to the documentation for the function: https://keras.io/preprocessing/sequence/#pad_sequences

from keras.preprocessing.sequence import pad_sequences    

input_sequence =[[1,2,3], [1,2], [1,4]]

padded_sequence = pad_sequences(input_sequence, maxlen=3, padding='post', truncating='post', value=0.0)

print padded_sequence

The output:

[[1 2 3]
 [1 2 0]
 [1 4 0]]
Nanda
  • 1,038
  • 2
  • 16
  • 33