0

I've a column as product description containing a description of the product i.e Kanchivaram saree of red colour. I want to extract just the product type i.e "saree" in this case. Following is an example of the table

product_description product_type
kanchivaram saree of red colour saree
Pink gujrati saree saree
Lehenga from Surat lehenga
Red swim suit swim suit

Is there an algorithm or a way I can do that in python.

Aditya
  • 25
  • 4

1 Answers1

-1

Suppose you have dataframedfas below.

df

    product_description
0   kanchivaram saree of red colour
1   Pink gujrati saree
2   Lehenga from Surat
3   Red swim suit

You will require list of words as which will contain in column product_type as below.

lst= ['saree','Lehenga','swim suit']

Then lst will iterate over each row in column product_description and create product_type column as below code. Using Regex -Efficient for Big DataFrames, also it is case insensitive.

import pandas as pd 
# initialize data of lists.
data = {'product_description': ['kanchivaram saree of red colour', 'Pink gujrati saree', 'Lehenga from Surat', 'Red swim suit'],} 
# Create DataFrame
df = pd.DataFrame(data)
lst = ['saree','Lehenga','swim suit']
regex = re.compile(fr"\s*({'|'.join(re.escape(x) for x in lst)})", re.IGNORECASE)
df['product_type_using_regex'] = df['product_description'].str.extract(regex, '')
df

Alertnate Method-(Case Sensitive) Complete code-

import pandas as pd 
# initialize data of lists.
data = {'product_description': ['kanchivaram saree of red colour', 'Pink gujrati saree', 'Lehenga from Surat', 'Red swim suit'],} 
# Create DataFrame
df = pd.DataFrame(data)
lst = ['saree','Lehenga','swim suit']
df['product_type'] = df['product_description'].apply(lambda x: ';'.join([m for m in lst if m in x])).replace('',np.nan)
df

Output-

   product_description              product_type
0   kanchivaram saree of red colour saree
1   Pink gujrati saree              saree
2   Lehenga from Surat              Lehenga
3   Red swim suit                   swim suit
Divyank
  • 811
  • 2
  • 10
  • 26
  • 1
    Looping and using `in` is certainly an inefficient way to do this, performance will be terrible on large inputs. – mozway Aug 16 '22 at 07:28
  • do let us know any other way to improve performance – Divyank Aug 16 '22 at 07:29
  • You can use a regex for example. See duplicate. – mozway Aug 16 '22 at 07:31
  • done updated code using regex & thanks `regex = re.compile(fr"\s*({'|'.join(re.escape(x) for x in lst)})", re.IGNORECASE) df['product_type_using_regex'] = df['product_description'].str.extract(regex, '')` – Divyank Aug 16 '22 at 07:48