Matching a word in Pandas column and when creating a new column based on the match

Question

I have a pandas dataframe with multiple columns and a dict with key and values as a lists. in the df one column represents a description, I need to look at this description and check if it matches one of the values in the list of the dict.

This is an extract from the dict:

clothing_types = {'T-Shirt': ['t-shirt', 'shirt', 'tee'],
          'Tank Top': ['tank top', 'mesh', 'top', 'tank'],
          'Socks': ['socks'],
          'Hat': ['cap'],
          'Trainers': ['trainers', 'snickers', 'shoes', 'furylite 
          contemporary'}

This is the column:

0       UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP
1            UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS
2            UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS
3                     UNDER ARMOUR LADIES PLAY UP SHORTS
4             REEBOK LADIES CLASSIC LEATHER MID TRAINERS
5      UNDER ARMOUR MENS Spring Performance Oxford SHIRT
6       UNDER ARMOUR LADIES HEATGEAR ALPHA SHORTY SHORTS
7                                 ADIDAS LADIES PRO TANK
8                REEBOK LADIES ONE SERIES V NECK T-SHIRT
9                              REEBOK LADIES DF LONG BRA
10                     NIKE LADIES BASELINE TENNIS SKIRT
11              UNDER ARMOUR MENS ESCAPE 7" SOLID SHORTS
12      UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP

I can do the comparison through the normal for loops:

for item in self.original_file['Product Description'].tolist():
    found = False
    for item_type, type_descriptions in clothing_types.items():
        for description in type_descriptions:
            if description.upper() in item.upper():
                # print(item_type, item)
                found = True
                break

    if not found:
        print('NOT FOUND', item)

And have tried to do it with the np.where:

for item_type, type_descriptions in clothing_types.items():
    for description in type_descriptions:
        self.original_file['Category'] = np.where(description.upper() in self.original_file['Product Description'], item_type, 'None')

but it replaces the values with the last value comparison which makes the column value always None

The expectation is that if the let say "SHIRT" is in the description "T-Shirt" (which is a key of the dict) will be populated in the new column - Category

What do you do if there is a match? What is the expected output? A column of True/False? — cs95, Jul 21 '19 at 18:04
I need the key of the dictionary populated, so if there is a match for shirt -> T-Shirt will be populated — zendek, Jul 21 '19 at 18:05

Erfan · Answer 1 · 2019-07-21T19:03:11.533

We can check with str.contains if we find any matches. If we get a hit, we fill in the key of the dictionary, else nothing. Finally we remove all whitespaces and the matches as a column:

matches = [np.where(df['Product Description'].str.contains('|'.join(v), case=False), 
                    k, 
                    '') for k, v in clothing_types.items()]

matches_df = pd.DataFrame(matches).T.sum(axis=1).to_frame('Matches')

df = df.join(matches_df)

Output

                                  Product Description   Matches
0    UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP  Tank Top
1         UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS     Socks
2         UNDER ARMOUR LADIES SPEEDFORM NO SHOW SOCKS     Socks
3                  UNDER ARMOUR LADIES PLAY UP SHORTS          
4          REEBOK LADIES CLASSIC LEATHER MID TRAINERS  Trainers
5   UNDER ARMOUR MENS Spring Performance Oxford SHIRT   T-Shirt
6    UNDER ARMOUR LADIES HEATGEAR ALPHA SHORTY SHORTS          
7                              ADIDAS LADIES PRO TANK  Tank Top
8             REEBOK LADIES ONE SERIES V NECK T-SHIRT   T-Shirt
9                           REEBOK LADIES DF LONG BRA          
10                  NIKE LADIES BASELINE TENNIS SKIRT          
11           UNDER ARMOUR MENS ESCAPE 7" SOLID SHORTS       Hat
12   UNDER ARMOUR LADIES FLY-BY STRETCH MESH TANK TOP  Tank Top

score 0 · Answer 2 · answered Jul 21 '19 at 18:34

So this works, but not sure if this is the best solution

for i in self.original_file.index:
    for item_type, type_descriptions in clothing_types.items():
        for description in type_descriptions:
            if description.upper() in self.original_file.iloc[i]['Product Description'].upper():
                self.original_file.at[i, 'Category'] = item_type

score 0 · Answer 3 · answered Jul 21 '19 at 18:41

First, you should switch between keys and values in your clothing_types dict like that

lothing_types2 = dict(list(itertools.chain(*[[(y_, x) for y_ in y] for x, y in clothing_types.items()])))

(reference)

Then, create a function to search per row, if there is any word in the new dict you were created:

def to_category(x):
    for w in x.lower().split(" "):
        if w in clothing_types2:
            return clothing_types2[w]
    return None

Finally, apply the method on the column and save the result to a new one:

df["Category"] = df["clothing"].apply(to_category)

Matching a word in Pandas column and when creating a new column based on the match

3 Answers3