I have a dataframe following format
+-------------------------------------------------------------------------------------------------+
|value |
+-------------------------------------------------------------------------------------------------+
|datalake-performance/raw/bamboohr/bamboohr_custom_turnover_data/2020/12/10/11:15.csv |
|datalake-performance/raw/gitlab/002429d9-908c-497b-96ba-67794b31f0cd |
|datalake-performance/processed/bamboohr/employee/04-08-2020/16:23.csv |
|datalake-performance/raw/zoom/user/year=2020/month=09/day=22/a329affc-b1f5-45d1-932a-fbb13d9873d6|
+-------------------------------------------------------------------------------------------------+
I want to generate a new column with following format:
newcol
[bamboohr_custom_turnover_data, 2020, 12, 10]
[]
[employee, 04-08-2020]
[user, year=2020, month=09, day=22]
To do so with Pandas, looks like this
df['value'].str.split('/').str[3:-1]
I have tried the below with PySpark but I receive an error
df = df.withColumn("list", (split(col("value"), "/")))
df.select(slice(df["list"], 3, size(df["list"]) - (3 + 1)))
TypeError: Column is not iterable
How do I get the slice through [3:-1]
in PySpark?