I have two data frames (df1
and df2
).
df1
looks something like this:
data = [['RJ REYNOLDS VAPOR', 'Tobacco Products', 21, '4442fa51-d4b7-4d56-ad7c-d18cfd2aff84', 'Mature'],
['BEYOND MEAT', 'Packaged Meat', 30, 'e73f7957-0e65-4466-9588-795bdc5f67ac', 'Pantry'],
['BEYOND MEAT', 'Plant-Based Meat', 1584, 'd8cb60e5-b0c6-478a-971d-c6c55b17831f', 'Meat & Seafood'],
['BEYOND MEAT', 'Frozen Plant-Based Meat', 313, '8e0a9431-5462-4810-9f65-68fe36adf454', 'Frozen'],
['TARGET', 'Wine', 39, 'ca1c0f4d-3efc-4978-8357-69862996f416', 'Alcohol'],
['TARGET', 'Deodorant & Antiperspirant', 192, 'ca1c0f4d-3efc-4978-8357-69862996f416', 'Health & Wellness'],
['DOVE', 'Bath & Body', 15821, '47fae368-9b64-40be-82c9-898f953e9d66', 'Health & Wellness']]
df1 = pd.DataFrame(data, columns=['BRAND', 'PRODUCT_CATEGORY', 'RECEIPTS', 'CATEGORY_ID', 'PARENT_CATEGORY'])
Looks something like this (the original is 9906 x 5):
BRAND | PRODUCT_CATEGORY | RECEIPTS | CATEGORY_ID | PARENT_CATEGORY |
---|---|---|---|---|
RJ REYNOLDS VAPOR | Tobacco Products | 21 | 4442fa51-d4b7-4d56-ad7c-d18cfd2aff84 | Mature |
BEYOND MEAT | Packaged Meat | 30 | e73f7957-0e65-4466-9588-795bdc5f67ac | Pantry |
BEYOND MEAT | Plant-Based Meat | 1584 | d8cb60e5-b0c6-478a-971d-c6c55b17831f | Meat & Seafod |
df2
looks something like this:
offer_data = [['Beyond Meat® Plant-Based products, spend $25', None, 'BEYOND MEAT'],
['Beyond Steak™ Plant-Based seared tips, 10 ounce at Target', 'TARGET', 'BEYOND MEAT'],
['Beyond Steak™ Plant-Based seared tips, 10 ounce at H-E-B', 'H-E-B', 'BEYOND MEAT'],
['Dove Hand Wash select varieties buy 2 at TARGET', 'TARGET', 'DOVE'],
['Dove Hand Wash select varieties at Target', 'TARGET', 'DOVE']]
df2 = pd.DataFrame(offer_data, columns = ['OFFER', 'RETAILER', 'BRAND'])
df2
Original is 384 x 3
OFFER | RETAILER | BRAND |
---|---|---|
Beyond Meat® Plant-Based products, spend $25 | None | BEYOND MEAT |
Beyond Steak™ Plant-Based seared tips, 10 ounce at Target | TARGET | BEYOND MEAT |
My goal is to add the appropriate 'PRODUCT_CATEGORY', 'PARENT_CATEGORY' and 'CATEGORY_ID' to each offer.
First I removed all non-alphanumeric characters from the 'OFFER" column in df2
df2['OFFER'] = df2.OFFER.str.replace('[^\w\s]', '')
then I extracted keywords from the 'OFFERS' column using the yake library:
import yake
offers = df2['OFFER'].unique()
kw_extractor = yake.KeywordExtractor()
language = 'en'
max_ngram_size = 3
deduplication_threshold = 0.9
numOfKeywords = 6
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)
tags = []
for offer in offers:
keywords = custom_kw_extractor.extract_keywords(offer)
subtags = []
for kw in keywords:
subtags.append(kw[0])
tags.append(subtags)
Which gave me a list of lists of keywords for each offer. I am pretty new to using NLP, so this is where I am getting stuck. I can match the 'BRAND' in df1
to the 'BRAND' in df2
to filter out most of the rows in df1
. Some offers are only available at specific retailers and all offers belong to a 'PRODUCT'/'PARENT' category. My idea is to use the keywords extracted from the offers and find a closely related row match from df1
. In the example data frame above some offers can fit into multiple categories like the 'BEYOND MEAT' categories. What is the best way to iterate through both data frames to find the appropriate 'PRODUCT_CATEGORY'/'PARENT_CATEGORY' for each offer? Or would there be an easier (more effective/accurate?) method to go about this?