2

Let's say i have a dataset with a Column

 # Output
 #+-----------------+
 #|         arrayCol|
 #+-----------------+
 #| [1, 2, 3, 4, 5] |
 #+-----------------+

I want to know if it is possible to split this column into smaller chunks of max_size without using UDF.

The result desired is as following with a max_size = 2 :

 # Output
 #+-----------------------+
 #|               arrayCol|
 #+-----------------------+
 #| [[1, 2], [3, 4], [5]] |
 #+-----------------------+
gael
  • 157
  • 10

1 Answers1

5

Another way of using transform and filter is using if and using mod to decide the splits and using slice (slices an array)

from pyspark.sql import functions as F
n = 2
df.withColumn("NewCol",F.expr(f""" 
               filter(
         transform(arrayCol,(x,i)-> if (i%{n}=0 ,slice(arrayCol,i+1,{n}), null)),x->
               x is not null)                               
""")).show(truncate=False)


+---------------+---------------------+
|arrayCol       |NewCol               |
+---------------+---------------------+
|[1, 2, 3, 4, 5]|[[1, 2], [3, 4], [5]]|
+---------------+---------------------+
anky
  • 74,114
  • 11
  • 41
  • 70