Pyspark or spark sql solution to stripping everything after the third word in string column

Question

I have a dataframe with a text column that looks like this:

product 
"HI Celebrate Cake White    612612" 
"GO Choc Celebrat bl    682222" 
"BI Chocolate Buttercream f34   9282222" 
"Graduation Cake    28" 
"Slab Image Cake  13273" 
"Slab Celebration Cake  2556659" 
"Grain Bread"

I want to be able to strip everything after the third word. How do I go about doing this with pyspark or spark sql?

Possible duplicate of https://stackoverflow.com/questions/53153149/remove-last-few-characters-in-pyspark-dataframe-column/55353098#55353098 — Shantanu Sharma, Mar 30 '19 at 18:38

score 1 · Answer 1 · answered Mar 29 '19 at 15:48

You can use a regex to extract the first three words.

df.select(regexp_extract(col("product"), "([^\\s]+\\s+){0,2}[^\\s]+", 0))\
  .show(truncate=False)

+--------------------------------------------------+
|regexp_extract(product, ([^\s]+\s+){0,2}[^\s]+, 0)|
+--------------------------------------------------+
|HI Celebrate Cake                                 |
|GO Choc Celebrat                                  |
|BI Chocolate Buttercream                          |
|Graduation Cake    28                             |
|Slab Image Cake                                   |
|Slab Celebration Cake                             |
|Grain Bread                                       |
+--------------------------------------------------+

score 1 · Answer 2 · answered Mar 29 '19 at 17:04

I found the solution:

from pyspark.sql.functions import regexp_extract, col, split
from pyspark.sql import functions as sf 
df_test=spark.sql("select * from brand_cleanup")
#Applying the transformations to the data

split_col=split(df_test.item_eng_desc,' ')
df_split=df_test.withColumn('item_desc_clean',sf.concat(split_col.getItem(0),sf.lit(' '),split_col.getItem(1),sf.lit(' '),split_col.getItem(2)))

Pyspark or spark sql solution to stripping everything after the third word in string column

2 Answers2