25

Why does column 1st_from_end contain null:

from pyspark.sql.functions import split
df = sqlContext.createDataFrame([('a b c d',)], ['s',])
df.select(   split(df.s, ' ')[0].alias('0th'),
             split(df.s, ' ')[3].alias('3rd'),
             split(df.s, ' ')[-1].alias('1st_from_end')
         ).show()

enter image description here
I thought using [-1] was a pythonic way to get the last item in a list. How come it doesn't work in pyspark?

jamiet
  • 10,501
  • 14
  • 80
  • 159

6 Answers6

43

For Spark 2.4+, use pyspark.sql.functions.element_at, see below from the documentation:

element_at(array, index) - Returns element of array at given (1-based) index. If index < 0, accesses elements from the last to the first. Returns NULL if the index exceeds the length of the array.

from pyspark.sql.functions import element_at, split, col

df = spark.createDataFrame([('a b c d',)], ['s',])

df.withColumn('arr', split(df.s, ' ')) \
  .select( col('arr')[0].alias('0th')
         , col('arr')[3].alias('3rd')
         , element_at(col('arr'), -1).alias('1st_from_end')
     ).show()

+---+---+------------+
|0th|3rd|1st_from_end|
+---+---+------------+
|  a|  d|           d|
+---+---+------------+
jxc
  • 13,553
  • 4
  • 16
  • 34
20

If you're using Spark >= 2.4.0 see jxc's answer below.

In Spark < 2.4.0, dataframes API didn't support -1 indexing on arrays, but you could write your own UDF or use built-in size() function, for example:

>>> from pyspark.sql.functions import size
>>> splitted = df.select(split(df.s, ' ').alias('arr'))
>>> splitted.select(splitted.arr[size(splitted.arr)-1]).show()
+--------------------+
|arr[(size(arr) - 1)]|
+--------------------+
|                   d|
+--------------------+
Mariusz
  • 13,481
  • 3
  • 60
  • 64
  • 4
    Thanks for confirming my suspicions. My solution was a bit more hokey than that: `reverse(split(reverse(df.s), ' ')[0])` – jamiet Nov 07 '16 at 20:18
1

You can also use the getItem method, which allows you to get the i-th item of an ArrayType column. Here's how I would do it:

from pyspark.sql.functions import split, col, size

df.withColumn("Splits", split(col("s"), " ")) \
    .withColumn("0th", col("Splits").getItem(0)) \
    .withColumn("3rd", col("Splits").getItem(3)) \
    .withColumn("1st_from_end", col("Splits").getItem(size(col("Splits"))-1)) \
    .drop("Splits")
Moein
  • 101
  • 5
0

Building on jamiet 's solution, we can simplify even further by removing a reverse

from pyspark.sql.functions import split, reverse

df = sqlContext.createDataFrame([('a b c d',)], ['s',])
df.select(   split(df.s, ' ')[0].alias('0th'),
             split(df.s, ' ')[3].alias('3rd'),
             reverse(split(df.s, ' '))[-1].alias('1st_from_end')
         ).show()
Matthew Cox
  • 1,047
  • 10
  • 23
0

Here is a small trick to use column expressions. It's quite neat because there is no use of udf. But still the functional interface makes it quite likable for me.


from pyspark.sql.session import SparkSession
from pyspark.sql.types import StringType, StructField, StructType
from pyspark.sql.functions import split, size
from pyspark.sql import Column
spark = SparkSession.builder.getOrCreate()


data = [
    ('filename', 's3:/hello/no.csv'),
    ('filename', 's3:/hello/why.csv')
]
schema = StructType([
    StructField('name', StringType(), True),
    StructField('path', StringType(), True)
])
df = spark.createDataFrame(data, schema=schema)



def expression_last_item_of_array(split_column: str, split_delimeter: str) -> Column:
    """
        Given column name and delimeter, return expression 
        for splitting string and returning last item of the array.
        
        Args:
            split_column: str
            split_delimeter: str
        
        Returns:
            pysaprk.sql.Column
    """
    expression = split(split_column, split_delimeter)
    n = size(expression)
    last = expression.getItem(n - 1)
    return last, n

last, n = expression_last_item_of_array('path', '/')
df.show(),
df.select(last.alias('last_element'), n.alias('n_items')).show(), df.select(last.alias('last_element')).show()

Output:

+--------+-----------------+
|    name|             path|
+--------+-----------------+
|filename| s3:/hello/no.csv|
|filename|s3:/hello/why.csv|
+--------+-----------------+

+------------+-------+
|last_element|n_items|
+------------+-------+
|      no.csv|      3|
|     why.csv|      3|
+------------+-------+

+------------+
|last_element|
+------------+
|      no.csv|
|     why.csv|
+------------+

DivineCoder
  • 702
  • 5
  • 11
-2

Create your own udf would look like this

    def get_last_element(l):
        return l[-1]
    get_last_element_udf = F.udf(get_last_element)

    df.select(get_last_element(split(df.s, ' ')).alias('1st_from_end')