How to slice until the last item to form new columns?

Question

I have a dataframe following format

+-------------------------------------------------------------------------------------------------+
|value                                                                                            |
+-------------------------------------------------------------------------------------------------+
|datalake-performance/raw/bamboohr/bamboohr_custom_turnover_data/2020/12/10/11:15.csv             |
|datalake-performance/raw/gitlab/002429d9-908c-497b-96ba-67794b31f0cd                             |
|datalake-performance/processed/bamboohr/employee/04-08-2020/16:23.csv                            |
|datalake-performance/raw/zoom/user/year=2020/month=09/day=22/a329affc-b1f5-45d1-932a-fbb13d9873d6|
+-------------------------------------------------------------------------------------------------+

I want to generate a new column with following format:

newcol
[bamboohr_custom_turnover_data, 2020, 12, 10]
[]
[employee, 04-08-2020]
[user, year=2020, month=09, day=22]

To do so with Pandas, looks like this

df['value'].str.split('/').str[3:-1]

I have tried the below with PySpark but I receive an error

df = df.withColumn("list", (split(col("value"), "/")))    
df.select(slice(df["list"], 3, size(df["list"]) - (3 + 1)))

TypeError: Column is not iterable

How do I get the slice through [3:-1] in PySpark?

score 1 · Accepted Answer · answered Jan 18 '21 at 09:02

You can use Spark SQL functions slice and size to achieve slicing. Note that Spark SQL array indices start from 1 instead of 0.

df2 = df.selectExpr("slice(split(value, '/'), 4, size(split(value, '/')) - 4) newcol")

df2.show(truncate=False)
+---------------------------------------------+
|newcol                                       |
+---------------------------------------------+
|[bamboohr_custom_turnover_data, 2020, 12, 10]|
|[]                                           |
|[employee, 04-08-2020]                       |
|[user, year=2020, month=09, day=22]          |
+---------------------------------------------+

score 1 · Answer 2 · answered Jan 18 '21 at 09:08

slice function can also accept negative index start in order to start from the end. You want 4 parts and ignore the last one, so start from -5 and take 4:

from pyspark.sql.functions import col, split, slice

df = df.withColumn("newcol", slice(split(col("value"), "/"), -5, 4)) 
df.select("newcol").show(truncate=False)

#+---------------------------------------------+
#|newcol                                       |
#+---------------------------------------------+
#|[bamboohr_custom_turnover_data, 2020, 12, 10]|
#|[]                                           |
#|[processed, bamboohr, employee, 04-08-2020]  |
#|[user, year=2020, month=09, day=22]          |
#+---------------------------------------------+

Vaebhav · Answer 3 · 2021-01-18T05:50:42.043

0

You can try something like this -

import pyspark.sql.functions as F

df_updated = df.withColumn("new value",df.select(F.split(df.value,"/")).rdd.flatMap(
            lambda x: x[3:-1]))

Other references - here

edited Jan 18 '21 at 05:50

answered Jan 18 '21 at 05:44

Vaebhav

4,672
1
13
33

Output from that, "AssertionError: col should be Column" – Alisher Smagulov Jan 18 '21 at 09:49

How to slice until the last item to form new columns?

3 Answers3