How to match rows of a dataframe rows to another dataframe row using Python NLP techniques

Question

I have two data frames (df1 and df2). df1 looks something like this:

data = [['RJ REYNOLDS VAPOR', 'Tobacco Products', 21, '4442fa51-d4b7-4d56-ad7c-d18cfd2aff84', 'Mature'],
['BEYOND MEAT', 'Packaged Meat', 30, 'e73f7957-0e65-4466-9588-795bdc5f67ac', 'Pantry'],
['BEYOND MEAT', 'Plant-Based Meat', 1584, 'd8cb60e5-b0c6-478a-971d-c6c55b17831f', 'Meat & Seafood'],
['BEYOND MEAT', 'Frozen Plant-Based Meat', 313, '8e0a9431-5462-4810-9f65-68fe36adf454', 'Frozen'],
['TARGET', 'Wine', 39, 'ca1c0f4d-3efc-4978-8357-69862996f416', 'Alcohol'],
['TARGET', 'Deodorant & Antiperspirant', 192, 'ca1c0f4d-3efc-4978-8357-69862996f416', 'Health & Wellness'],
['DOVE', 'Bath & Body', 15821, '47fae368-9b64-40be-82c9-898f953e9d66', 'Health & Wellness']]

df1 = pd.DataFrame(data, columns=['BRAND', 'PRODUCT_CATEGORY', 'RECEIPTS', 'CATEGORY_ID', 'PARENT_CATEGORY'])

Looks something like this (the original is 9906 x 5):

BRAND	PRODUCT_CATEGORY	RECEIPTS	CATEGORY_ID	PARENT_CATEGORY
RJ REYNOLDS VAPOR	Tobacco Products	21	4442fa51-d4b7-4d56-ad7c-d18cfd2aff84	Mature
BEYOND MEAT	Packaged Meat	30	e73f7957-0e65-4466-9588-795bdc5f67ac	Pantry
BEYOND MEAT	Plant-Based Meat	1584	d8cb60e5-b0c6-478a-971d-c6c55b17831f	Meat & Seafod

df2 looks something like this:

offer_data = [['Beyond Meat® Plant-Based products, spend $25', None, 'BEYOND MEAT'],
['Beyond Steak™ Plant-Based seared tips, 10 ounce at Target', 'TARGET', 'BEYOND MEAT'],
['Beyond Steak™ Plant-Based seared tips, 10 ounce at H-E-B', 'H-E-B', 'BEYOND MEAT'],
['Dove Hand Wash select varieties buy 2 at TARGET', 'TARGET', 'DOVE'],
['Dove Hand Wash select varieties at Target', 'TARGET', 'DOVE']]

df2 = pd.DataFrame(offer_data, columns = ['OFFER', 'RETAILER', 'BRAND'])

df2 Original is 384 x 3

OFFER	RETAILER	BRAND
Beyond Meat® Plant-Based products, spend $25	None	BEYOND MEAT
Beyond Steak™ Plant-Based seared tips, 10 ounce at Target	TARGET	BEYOND MEAT

My goal is to add the appropriate 'PRODUCT_CATEGORY', 'PARENT_CATEGORY' and 'CATEGORY_ID' to each offer.

First I removed all non-alphanumeric characters from the 'OFFER" column in df2

df2['OFFER'] = df2.OFFER.str.replace('[^\w\s]', '')

then I extracted keywords from the 'OFFERS' column using the yake library:

import yake

offers = df2['OFFER'].unique()

kw_extractor = yake.KeywordExtractor()
language = 'en'
max_ngram_size = 3
deduplication_threshold = 0.9
numOfKeywords = 6
custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, top=numOfKeywords, features=None)


tags = []
for offer in offers:
    keywords = custom_kw_extractor.extract_keywords(offer)
    subtags = []
    for kw in keywords:
        subtags.append(kw[0])
        
    tags.append(subtags)

Which gave me a list of lists of keywords for each offer. I am pretty new to using NLP, so this is where I am getting stuck. I can match the 'BRAND' in df1 to the 'BRAND' in df2 to filter out most of the rows in df1. Some offers are only available at specific retailers and all offers belong to a 'PRODUCT'/'PARENT' category. My idea is to use the keywords extracted from the offers and find a closely related row match from df1. In the example data frame above some offers can fit into multiple categories like the 'BEYOND MEAT' categories. What is the best way to iterate through both data frames to find the appropriate 'PRODUCT_CATEGORY'/'PARENT_CATEGORY' for each offer? Or would there be an easier (more effective/accurate?) method to go about this?

Michael Hodel · Answer 1 · 2023-08-05T00:39:42.313

The following is a fairly simple approach close to what you suggested, using the Jaccard similarity, here the ratio of the size of the intersection between the offer kewords and the brand/category keywords to size of the union of the two keyword sets. Each offer simply gets associated whichever brand/category gives the highest Jaccard index (i.e. the columns ['PRODUCT_CATEGORY', 'PARENT_CATEGORY', 'CATEGORY_ID'] are added to df2).

import pandas as pd

#data = ...
#offer_data = ...

df1 = pd.DataFrame(data, columns=['BRAND', 'PRODUCT_CATEGORY', 'RECEIPTS', 'CATEGORY_ID', 'PARENT_CATEGORY'])
df2 = pd.DataFrame(offer_data, columns=['OFFER', 'RETAILER', 'BRAND'])

df1.fillna('', inplace=True)
df2.fillna('', inplace=True)

def keywords(x):
    return set(' '.join(x).lower().split(' '))

def similarity(a, b):
    if len(a) == 0 or len(b) == 0:
        return 0
    return len(a & b) / len(a | b)

df1_kws = df1[['BRAND', 'PRODUCT_CATEGORY', 'PARENT_CATEGORY']].apply(keywords, axis=1)
identifier = df1[['PRODUCT_CATEGORY', 'PARENT_CATEGORY', 'CATEGORY_ID']].values

res = []
for offer_kws in df2[['OFFER', 'RETAILER', 'BRAND']].apply(f, axis=1):
    best_sim = -1
    best = None
    for idx, kws in zip(identifier, df1_kws):
        sim = similarity(offer_kws, kws)
        if sim > best_sim:
            best = idx
            best_sim = sim
    res.append(best)

df2[['PRODUCT_CATEGORY', 'PARENT_CATEGORY', 'CATEGORY_ID']] = res

Note that this does not work well as is and is mainly meant for demonstration purposes. Of course one could attempt to improve the keyword extraction from the dataframe rows (see keywords) or tweak the definition of the similarity measure (see similarity), or in other ways build in some custom logic to the selection process. However, I feel an approach that may be very appropriate here is an encoder that maps a sentence (e.g. sequence of keywords or even mere concatenation of the strings in the relevant dataframe columns) to an embedding vector. Oftentimes, e.g. cosine similarity between such vector nicely approximates semantic similarity. I can only recommend to have a look at this nice overview about sentence similarity.

How to match rows of a dataframe rows to another dataframe row using Python NLP techniques

1 Answers1