Extracting a sub array from PySpark DataFrame column

Question

I wish to remove the last element of the array from this DataFrame. We have this link demonstrating the same thing, but with UDFs and that I wish to avoid. Is there is simple way to do this - something like list[:2]?

data = [(['cat','dog','sheep'],),(['bus','truck','car'],),(['ice','pizza','pasta'],)]
df = sqlContext.createDataFrame(data,['data'])
df.show()
+-------------------+
|               data|
+-------------------+
|  [cat, dog, sheep]|
|  [bus, truck, car]|
|[ice, pizza, pasta]|
+-------------------+

Expected DataFrame:

+--------------+
|          data|
+--------------+
|    [cat, dog]|
|  [bus, truck]|
|  [ice, pizza]|
+--------------+

Are all the lists of the same size? Do you know that length ahead of time? — pault, Dec 17 '18 at 15:42
Yeah, they were all of size 3. If you have any method to achieve the result avoiding a `UDF`, kindly pen it down. Many thanks! — cph_sto, Dec 17 '18 at 15:47

score 2 · Answer 1 · answered Dec 17 '18 at 09:39

2

UDF is the best thing you can find for PySpark :)

from pyspark.sql.functions import udf
from pyspark.sql.types import StructType

# Get the fist two elements 
split_row = udf(lambda row: row[:2])

# apply the udf to each row
new_df = df.withColumn("data", split_row(df["data"]))

new_df.show()
# Output

+------------+
|        data|
+------------+
|  [cat, dog]|
|[bus, truck]|
|[ice, pizza]|
+------------+

answered Dec 17 '18 at 09:39

LaSul

2,231
1
20
36

I know how to do with `UDF`, but wanted to know how we can do that without using any `UDF`. `UDF` cause immense overhead because of serialization when the dataframe is very big, that's why I wanted to avoid it. Thanks for your efforts, very appreciated :) – cph_sto Dec 17 '18 at 09:43
There is nothing better than UDF if you want to work on big loads and apply current operations you can't usually do ;) – LaSul Dec 17 '18 at 09:56
Hi, If you check the execution plan, you can see the difference, especially on big loads. BTW, I haven't marked this answer negative. – cph_sto Dec 17 '18 at 09:58
Yup I know that changes the execution plan. If it doesn't change, it is much much slower. I don't see any "easy" way without UDF to do it tho – LaSul Dec 17 '18 at 10:02
Yes, that's a fair comment. So, I suppose there is none. Though it doesn't answer my question, but I will upvote it as at this time there seems to be no better solution on the horizon. Many many thanks Sir. – cph_sto Dec 17 '18 at 10:04

Extracting a sub array from PySpark DataFrame column

1 Answers1