Is there a way to extract specific text from another string in python?

Question

I've a column as product description containing a description of the product i.e Kanchivaram saree of red colour. I want to extract just the product type i.e "saree" in this case. Following is an example of the table

product_description	product_type
kanchivaram saree of red colour	saree
Pink gujrati saree	saree
Lehenga from Surat	lehenga
Red swim suit	swim suit

Is there an algorithm or a way I can do that in python.

sure... but if you already have the product_type for each description why do you need to extract it? — Alexander, Aug 16 '22 at 07:23

Divyank · Answer 1 · 2022-08-16T07:47:12.967

Suppose you have dataframedfas below.

df

    product_description
0   kanchivaram saree of red colour
1   Pink gujrati saree
2   Lehenga from Surat
3   Red swim suit

You will require list of words as which will contain in column product_type as below.

lst= ['saree','Lehenga','swim suit']

Then lst will iterate over each row in column product_description and create product_type column as below code. Using Regex -Efficient for Big DataFrames, also it is case insensitive.

import pandas as pd 
# initialize data of lists.
data = {'product_description': ['kanchivaram saree of red colour', 'Pink gujrati saree', 'Lehenga from Surat', 'Red swim suit'],} 
# Create DataFrame
df = pd.DataFrame(data)
lst = ['saree','Lehenga','swim suit']
regex = re.compile(fr"\s*({'|'.join(re.escape(x) for x in lst)})", re.IGNORECASE)
df['product_type_using_regex'] = df['product_description'].str.extract(regex, '')
df

Alertnate Method-(Case Sensitive) Complete code-

import pandas as pd 
# initialize data of lists.
data = {'product_description': ['kanchivaram saree of red colour', 'Pink gujrati saree', 'Lehenga from Surat', 'Red swim suit'],} 
# Create DataFrame
df = pd.DataFrame(data)
lst = ['saree','Lehenga','swim suit']
df['product_type'] = df['product_description'].apply(lambda x: ';'.join([m for m in lst if m in x])).replace('',np.nan)
df

Output-

   product_description              product_type
0   kanchivaram saree of red colour saree
1   Pink gujrati saree              saree
2   Lehenga from Surat              Lehenga
3   Red swim suit                   swim suit

Looping and using `in` is certainly an inefficient way to do this, performance will be terrible on large inputs. — mozway, Aug 16 '22 at 07:28
done updated code using regex & thanks `regex = re.compile(fr"\s*({'|'.join(re.escape(x) for x in lst)})", re.IGNORECASE) df['product_type_using_regex'] = df['product_description'].str.extract(regex, '')` — Divyank, Aug 16 '22 at 07:48

Is there a way to extract specific text from another string in python?

1 Answers1