0

I have a dataframe with a text column that looks like this:

product 
"HI Celebrate Cake White    612612" 
"GO Choc Celebrat bl    682222" 
"BI Chocolate Buttercream f34   9282222" 
"Graduation Cake    28" 
"Slab Image Cake  13273" 
"Slab Celebration Cake  2556659" 
"Grain Bread" 

I want to be able to strip everything after the third word. How do I go about doing this with pyspark or spark sql?

pault
  • 41,343
  • 15
  • 107
  • 149
mblume
  • 243
  • 1
  • 3
  • 11
  • Possible duplicate of https://stackoverflow.com/questions/53153149/remove-last-few-characters-in-pyspark-dataframe-column/55353098#55353098 – Shantanu Sharma Mar 30 '19 at 18:38

2 Answers2

1

You can use a regex to extract the first three words.

df.select(regexp_extract(col("product"), "([^\\s]+\\s+){0,2}[^\\s]+", 0))\
  .show(truncate=False)
+--------------------------------------------------+
|regexp_extract(product, ([^\s]+\s+){0,2}[^\s]+, 0)|
+--------------------------------------------------+
|HI Celebrate Cake                                 |
|GO Choc Celebrat                                  |
|BI Chocolate Buttercream                          |
|Graduation Cake    28                             |
|Slab Image Cake                                   |
|Slab Celebration Cake                             |
|Grain Bread                                       |
+--------------------------------------------------+
Oli
  • 9,766
  • 5
  • 25
  • 46
1

I found the solution:

from pyspark.sql.functions import regexp_extract, col, split
from pyspark.sql import functions as sf 
df_test=spark.sql("select * from brand_cleanup")
#Applying the transformations to the data

split_col=split(df_test.item_eng_desc,' ')
df_split=df_test.withColumn('item_desc_clean',sf.concat(split_col.getItem(0),sf.lit(' '),split_col.getItem(1),sf.lit(' '),split_col.getItem(2)))
mblume
  • 243
  • 1
  • 3
  • 11