0

I have two data frames (df1 and df2). df1 looks something like this:

data = [['RJ REYNOLDS VAPOR', 'Tobacco Products', 21, '4442fa51-d4b7-4d56-ad7c-d18cfd2aff84', 'Mature'],
['BEYOND MEAT', 'Packaged Meat', 30, 'e73f7957-0e65-4466-9588-795bdc5f67ac', 'Pantry'],
['BEYOND MEAT', 'Plant-Based Meat', 1584, 'd8cb60e5-b0c6-478a-971d-c6c55b17831f', 'Meat & Seafood'],
['BEYOND MEAT', 'Frozen Plant-Based Meat', 313, '8e0a9431-5462-4810-9f65-68fe36adf454', 'Frozen'],
['TARGET', 'Wine', 39, 'ca1c0f4d-3efc-4978-8357-69862996f416', 'Alcohol'],
['TARGET', 'Deodorant & Antiperspirant', 192, 'ca1c0f4d-3efc-4978-8357-69862996f416', 'Health & Wellness'],
['DOVE', 'Bath & Body', 15821, '47fae368-9b64-40be-82c9-898f953e9d66', 'Health & Wellness']]

df1 = pd.DataFrame(data, columns=['BRAND', 'PRODUCT_CATEGORY', 'RECEIPTS', 'CATEGORY_ID', 'PARENT_CATEGORY'])

Looks something like this (the original is 9906 x 5):

BRAND PRODUCT_CATEGORY RECEIPTS CATEGORY_ID PARENT_CATEGORY
RJ REYNOLDS VAPOR Tobacco Products 21 4442fa51-d4b7-4d56-ad7c-d18cfd2aff84 Mature
BEYOND MEAT Packaged Meat 30 e73f7957-0e65-4466-9588-795bdc5f67ac Pantry
BEYOND MEAT Plant-Based Meat 1584 d8cb60e5-b0c6-478a-971d-c6c55b17831f Meat & Seafod

df2 looks something like this:

offer_data = [['Beyond Meat® Plant-Based products, spend $25', None, 'BEYOND MEAT'],
['Beyond Steak™ Plant-Based seared tips, 10 ounce at Target', 'TARGET', 'BEYOND MEAT'],
['Beyond Steak™ Plant-Based seared tips, 10 ounce at H-E-B', 'H-E-B', 'BEYOND MEAT'],
['Dove Hand Wash select varieties buy 2 at TARGET', 'TARGET', 'DOVE'],
['Dove Hand Wash select varieties at Target', 'TARGET', 'DOVE']]

df2 = pd.DataFrame(offer_data, columns = ['OFFER', 'RETAILER', 'BRAND'])

df2 Original is 384 x 3

OFFER RETAILER BRAND
Beyond Meat® Plant-Based products, spend $25 None BEYOND MEAT
Beyond Steak™ Plant-Based seared tips, 10 ounce at Target TARGET BEYOND MEAT

My goal is to add the appropriate 'PRODUCT_CATEGORY', 'PARENT_CATEGORY' and 'CATEGORY_ID' to each offer.

First I removed all non-alphanumeric characters from the 'OFFER" column in df2

df2['OFFER'] = df2.OFFER.str.replace('[^\w\s]', '')

then I extracted keywords from the 'OFFERS' column using the yake library:

import yake

offers = df2['OFFER'].unique()

kw_extractor = yake.KeywordExtractor()
language = 'en'
max_ngram_size = 3
deduplication_threshold = 0.9
numOfKeywords = 6
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)


tags = []
for offer in offers:
    keywords = custom_kw_extractor.extract_keywords(offer)
    subtags = []
    for kw in keywords:
        subtags.append(kw[0])
        
    tags.append(subtags)

Which gave me a list of lists of keywords for each offer. I am pretty new to using NLP, so this is where I am getting stuck. I can match the 'BRAND' in df1 to the 'BRAND' in df2 to filter out most of the rows in df1. Some offers are only available at specific retailers and all offers belong to a 'PRODUCT'/'PARENT' category. My idea is to use the keywords extracted from the offers and find a closely related row match from df1. In the example data frame above some offers can fit into multiple categories like the 'BEYOND MEAT' categories. What is the best way to iterate through both data frames to find the appropriate 'PRODUCT_CATEGORY'/'PARENT_CATEGORY' for each offer? Or would there be an easier (more effective/accurate?) method to go about this?

rzan1
  • 11
  • 3

1 Answers1

0

The following is a fairly simple approach close to what you suggested, using the Jaccard similarity, here the ratio of the size of the intersection between the offer kewords and the brand/category keywords to size of the union of the two keyword sets. Each offer simply gets associated whichever brand/category gives the highest Jaccard index (i.e. the columns ['PRODUCT_CATEGORY', 'PARENT_CATEGORY', 'CATEGORY_ID'] are added to df2).

import pandas as pd

#data = ...
#offer_data = ...

df1 = pd.DataFrame(data, columns=['BRAND', 'PRODUCT_CATEGORY', 'RECEIPTS', 'CATEGORY_ID', 'PARENT_CATEGORY'])
df2 = pd.DataFrame(offer_data, columns=['OFFER', 'RETAILER', 'BRAND'])

df1.fillna('', inplace=True)
df2.fillna('', inplace=True)

def keywords(x):
    return set(' '.join(x).lower().split(' '))

def similarity(a, b):
    if len(a) == 0 or len(b) == 0:
        return 0
    return len(a & b) / len(a | b)

df1_kws = df1[['BRAND', 'PRODUCT_CATEGORY', 'PARENT_CATEGORY']].apply(keywords, axis=1)
identifier = df1[['PRODUCT_CATEGORY', 'PARENT_CATEGORY', 'CATEGORY_ID']].values

res = []
for offer_kws in df2[['OFFER', 'RETAILER', 'BRAND']].apply(f, axis=1):
    best_sim = -1
    best = None
    for idx, kws in zip(identifier, df1_kws):
        sim = similarity(offer_kws, kws)
        if sim > best_sim:
            best = idx
            best_sim = sim
    res.append(best)

df2[['PRODUCT_CATEGORY', 'PARENT_CATEGORY', 'CATEGORY_ID']] = res

Note that this does not work well as is and is mainly meant for demonstration purposes. Of course one could attempt to improve the keyword extraction from the dataframe rows (see keywords) or tweak the definition of the similarity measure (see similarity), or in other ways build in some custom logic to the selection process. However, I feel an approach that may be very appropriate here is an encoder that maps a sentence (e.g. sequence of keywords or even mere concatenation of the strings in the relevant dataframe columns) to an embedding vector. Oftentimes, e.g. cosine similarity between such vector nicely approximates semantic similarity. I can only recommend to have a look at this nice overview about sentence similarity.

Michael Hodel
  • 2,845
  • 1
  • 5
  • 10