1

I am trying to get last n elements of each array column named Foo and make a separate column out of it called as last_n_items_of_Foo. Foo column array has variable length

I have looked at this article here but it has a method which cannot be used to access last elements.

import pandas as pd
from pyspark.sql.functions import udf, size
from pyspark.sql.types import StringType
from pyspark.sql.functions import col

df = pd.DataFrame([[[1,1,2,3],1,0],[[1,1,2,7,8,9],0,0],[[1,1,2,3,4,5,8],1,1]],columns = ['Foo','Bar','Baz'])


spark_df = spark.createDataFrame(df)

Here is how output should look

if n=2

                Foo        Bar  Baz   last_2_items_of_Foo  
0           [1, 1, 2, 3]    1    0      [2, 3]
1     [1, 1, 2, 7, 8, 9]    0    0      [8, 9] 
2  [1, 1, 2, 3, 4, 5, 8]    1    1      [5, 8]

1 Answers1

1

You can write your own UDF to get last n elements from Array:

import pyspark.sql.functions as f
import pyspark.sql.types as t

def get_last_n_elements_(arr, n):
            return arr[-n:]

get_last_n_elements = f.udf(get_last_n_elements_, t.ArrayType(t.IntegerType()))

UDF takes column datatype as argument so use f.lit(n)

spark_df.withColumn('last_2_items_of_Foo', get_last_n_elements('Foo', f.lit(2))).show()
+--------------------+---+---+-------------------+
|                 Foo|Bar|Baz|last_2_items_of_Foo|
+--------------------+---+---+-------------------+
|        [1, 1, 2, 3]|  1|  0|             [2, 3]|
|  [1, 1, 2, 7, 8, 9]|  0|  0|             [8, 9]|
|[1, 1, 2, 3, 4, 5...|  1|  1|             [5, 8]|
+--------------------+---+---+-------------------+

Apparently In spark 2.4, There is inbuilt function f.slice which can do slicing of an array.

currently i don't have 2.4+ version in my system but it will be like below:

spark_df.withColumn('last_2_items_of_Foo', f.slice('Foo', -2)).show()

SMaZ
  • 2,515
  • 1
  • 12
  • 26