How do I get the last item from a list using pyspark?

Question

Why does column 1st_from_end contain null:

from pyspark.sql.functions import split
df = sqlContext.createDataFrame([('a b c d',)], ['s',])
df.select(   split(df.s, ' ')[0].alias('0th'),
             split(df.s, ' ')[3].alias('3rd'),
             split(df.s, ' ')[-1].alias('1st_from_end')
         ).show()

I thought using [-1] was a pythonic way to get the last item in a list. How come it doesn't work in pyspark?

jxc · Answer 1 · 2019-11-12T14:50:21.053

For Spark 2.4+, use pyspark.sql.functions.element_at, see below from the documentation:

element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.

from pyspark.sql.functions import element_at, split, col

df = spark.createDataFrame([('a b c d',)], ['s',])

df.withColumn('arr', split(df.s, ' ')) \
  .select( col('arr')[0].alias('0th')
         , col('arr')[3].alias('3rd')
         , element_at(col('arr'), -1).alias('1st_from_end')
     ).show()

+---+---+------------+
|0th|3rd|1st_from_end|
+---+---+------------+
|  a|  d|           d|
+---+---+------------+

Mariusz · Accepted Answer · 2021-09-09T10:11:09.797

20

If you're using Spark >= 2.4.0 see jxc's answer below.

In Spark < 2.4.0, dataframes API didn't support -1 indexing on arrays, but you could write your own UDF or use built-in size() function, for example:

>>> from pyspark.sql.functions import size
>>> splitted = df.select(split(df.s, ' ').alias('arr'))
>>> splitted.select(splitted.arr[size(splitted.arr)-1]).show()
+--------------------+
|arr[(size(arr) - 1)]|
+--------------------+
|                   d|
+--------------------+

edited Sep 09 '21 at 10:11

answered Nov 07 '16 at 17:41

Mariusz

13,481
3
60
64

4

Thanks for confirming my suspicions. My solution was a bit more hokey than that: `reverse(split(reverse(df.s), ' ')[0])` – jamiet Nov 07 '16 at 20:18

Moein · Answer 3 · 2022-05-01T06:07:44.233

You can also use the getItem method, which allows you to get the i-th item of an ArrayType column. Here's how I would do it:

from pyspark.sql.functions import split, col, size

df.withColumn("Splits", split(col("s"), " ")) \
    .withColumn("0th", col("Splits").getItem(0)) \
    .withColumn("3rd", col("Splits").getItem(3)) \
    .withColumn("1st_from_end", col("Splits").getItem(size(col("Splits"))-1)) \
    .drop("Splits")

score 0 · Answer 4 · answered Jul 17 '19 at 22:28

0

Building on jamiet 's solution, we can simplify even further by removing a reverse

from pyspark.sql.functions import split, reverse

df = sqlContext.createDataFrame([('a b c d',)], ['s',])
df.select(   split(df.s, ' ')[0].alias('0th'),
             split(df.s, ' ')[3].alias('3rd'),
             reverse(split(df.s, ' '))[-1].alias('1st_from_end')
         ).show()

answered Jul 17 '19 at 22:28

Matthew Cox

1,047
10
23

shouldn't be reverse(split(df.s, ' '))[0].alias('1st_from_end') ? – Luis Tiago Flores Cristóvão Jun 28 '22 at 15:54

score 0 · Answer 5 · answered Mar 15 '23 at 10:14

Here is a small trick to use column expressions. It's quite neat because there is no use of udf. But still the functional interface makes it quite likable for me.


from pyspark.sql.session import SparkSession
from pyspark.sql.types import StringType, StructField, StructType
from pyspark.sql.functions import split, size
from pyspark.sql import Column
spark = SparkSession.builder.getOrCreate()


data = [
    ('filename', 's3:/hello/no.csv'),
    ('filename', 's3:/hello/why.csv')
]
schema = StructType([
    StructField('name', StringType(), True),
    StructField('path', StringType(), True)
])
df = spark.createDataFrame(data, schema=schema)



def expression_last_item_of_array(split_column: str, split_delimeter: str) -> Column:
    """
        Given column name and delimeter, return expression 
        for splitting string and returning last item of the array.
        
        Args:
            split_column: str
            split_delimeter: str
        
        Returns:
            pysaprk.sql.Column
    """
    expression = split(split_column, split_delimeter)
    n = size(expression)
    last = expression.getItem(n - 1)
    return last, n

last, n = expression_last_item_of_array('path', '/')
df.show(),
df.select(last.alias('last_element'), n.alias('n_items')).show(), df.select(last.alias('last_element')).show()

Output:

+--------+-----------------+
|    name|             path|
+--------+-----------------+
|filename| s3:/hello/no.csv|
|filename|s3:/hello/why.csv|
+--------+-----------------+

+------------+-------+
|last_element|n_items|
+------------+-------+
|      no.csv|      3|
|     why.csv|      3|
+------------+-------+

+------------+
|last_element|
+------------+
|      no.csv|
|     why.csv|
+------------+

score -2 · Answer 6 · answered Jul 15 '19 at 21:44

-2

Create your own udf would look like this

    def get_last_element(l):
        return l[-1]
    get_last_element_udf = F.udf(get_last_element)

    df.select(get_last_element(split(df.s, ' ')).alias('1st_from_end')

answered Jul 15 '19 at 21:44

Cmitropoulos

17
3

How do I get the last item from a list using pyspark?

6 Answers6

Linked