Pyspark - Split a column and take n elements

Question

I want to take a column and split a string using a character. As per usual, I understood that the method split would return a list, but when coding I found that the returning object had only the methods getItem or getField with the following descriptions from the API:

@since(1.3)   
def getItem(self, key):
    """
    An expression that gets an item at position ``ordinal`` out of a list,
    or gets an item by key out of a dict.


@since(1.3)
def getField(self, name):
    """
    An expression that gets a field by name in a StructField.

Obviously this doesnt meet my requirements, for example for the text within the column "A_B_C_D" I would like to split between "A_B_C_" and "D" in two different columns.

This is the code I'm using

from pyspark.sql.functions import regexp_extract, col, split
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data

split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(3))

Find an example:

from pyspark.sql import Row
from pyspark.sql.functions import regexp_extract, col, split
l = [("Item1_Item2_ItemN"),("FirstItem_SecondItem_LastItem"),("ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn")]
rdd = sc.parallelize(l)
datax = rdd.map(lambda x: Row(fullString=x))
df = sqlContext.createDataFrame(datax)
split_col=split(df['fullString'],'_')
df=df.withColumn('LastItemOfSplit',split_col.getItem(2))

Result:

fullString                                                LastItemOfSplit
Item1_Item2_ItemN                                            ItemN
FirstItem_SecondItem_LastItem                                LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn     null

My expected result would be having always the last item

fullString                                                LastItemOfSplit
Item1_Item2_ItemN                                            ItemN
FirstItem_SecondItem_LastItem                                LastItem
ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn  ThisShouldBeInTheLastColumn

You can pass in a regexp pattern to `split`. Perhaps `split(df_test["Full_text"], r"_(?=.$)")` would work. — pault, Mar 13 '19 at 13:35
@Psidom I would like another way to access the results of the split, not by the index number, like the last element of the resultant list (the size varies within the data). — Alejandro A, Mar 13 '19 at 13:39

score 6 · Accepted Answer · answered Mar 13 '19 at 13:51

You can use getItem(size - 1) to get the last item from the arrays:

Example:

df = spark.createDataFrame([[['A', 'B', 'C', 'D']], [['E', 'F']]], ['split'])
df.show()
+------------+
|       split|
+------------+
|[A, B, C, D]|
|      [E, F]|
+------------+

import pyspark.sql.functions as F
df.withColumn('lastItem', df.split.getItem(F.size(df.split) - 1)).show()
+------------+--------+
|       split|lastItem|
+------------+--------+
|[A, B, C, D]|       D|
|      [E, F]|       F|
+------------+--------+

For your case:

from pyspark.sql.functions import regexp_extract, col, split, size
df_test=spark.sql("SELECT * FROM db_test.table_test")
#Applying the transformations to the data

split_col=split(df_test['Full_text'],'_')
df_split=df_test.withColumn('Last_Item',split_col.getItem(size(split_col) - 1))

In newer versions of pyspark 'element_at' should be used instead of getItems(size( as jxc states in his answer here: https://stackoverflow.com/questions/40467936/how-do-i-get-the-last-item-from-a-list-using-pyspark — the_economist, Aug 28 '23 at 07:17

score 1 · Answer 2 · answered Mar 13 '19 at 14:07

You can pass in a regular expression pattern to split.

The following would work for your example:

from pyspark.sql.functions split

split_col=split(df['fullString'], r"_(?=.+$)")
df = df.withColumn('LastItemOfSplit', split_col.getItem(1))
df.show(truncate=False)
#+--------------------------------------------------------+---------------------------+
#|fullString                                              |LastItemOfSplit            |
#+--------------------------------------------------------+---------------------------+
#|Item1_Item2_ItemN                                       |Item2                      |
#|FirstItem_SecondItem_LastItem                           |SecondItem                 |
#|ThisShouldBeInTheFirstColumn_ThisShouldBeInTheLastColumn|ThisShouldBeInTheLastColumn|
#+--------------------------------------------------------+---------------------------+

The pattern means the following:

_ the literal underscore
(?=.+$) positive look-ahead for anything (.) until the end of the string $

This will split the string on the last underscore. Then call .getItem(1) to get the item at index 1 in the resultant list.

I got a similar problem where I'd like to split on the last '/'. `r"/(?=.+$)"` does not do the job :/. Can you please help me out? — the_economist, Aug 28 '23 at 07:11

Pyspark - Split a column and take n elements

2 Answers2

Linked

Related