2

I have a dataframe like so:

df = pd.DataFrame({'item_descrip': ['ebc root beer single', 
                                    'yic yac big pack freshmint', 
                                    'froggy jumbo flakes',
                                    'jumbo tart warmer',
                                    'beer jerky'
                                   ]
})

I have a list like so:

brand_list = ['ebc', 'yic yac', 'beer', 'jumbo', 'tart', 'froggy']

I want to match strings in the brand_list to the strings in the item_descrip column and remove the matches in the item_descrip column. I want to create another column unbranded that contains the cleaned strings from item_descrip.

My problem is that I have a very large brand_list and some of the strings from this list are matching multiple times in the item_descrip column. My desired output is if a match is already found for one row, then skip that row.

Desired output:

|    | item_descrip                       | unbranded                          |
|---:|:-----------------------------------|:-----------------------------------|
|  0 | ebc root beer single               | root beer single                   |
|  1 | yic yac big pack freshmint singles | big pack freshmint singles         |
|  2 | froggy jumbo flakes                | jumbo flakes                       |
|  3 | jumbo tart warmer                  | tart warmer                        |
|  4 | beer jerky                         | jerky                              |

This is the code that works to remove matches, but it removes all matches in the item_descrip column. For example, in my brand_list I have ebc and beer in the list. For the first record, I only want ebc to be removed and not beer since a match was already made. If a match is made on the first part of the string, then don't process that record any further and go onto the next.

So basically, it seems like an if statement could go into the list comprehension, but I'm not sure how to write something that says: if matched pass, else keep searching.

df['unbranded'] = [' '.join([y for y in x.split() if not y.startswith(tuple(brand_list))]) for x in df['item_descrip']] 

I got the most of this one-liner here: https://stackoverflow.com/questions/51666374/how-to-remove-strings-present-in-a-list-from-a-column-in-pandas

Pysnek313
  • 134
  • 14

1 Answers1

1

DISCLAIMER: I'm the author of trrex

If you care about performance use trrex:

import pandas as pd
import trrex as tx

df = pd.DataFrame({'item_descrip': ['ebc root beer single',
                                    'yic yac big pack freshmint',
                                    'froggy jumbo flakes',
                                    'jumbo tart warmer',
                                    'beer jerky'
                                    ]
                   })

brand_list = ['ebc', 'yic yac', 'beer', 'jumbo', 'tart', 'froggy']

df['unbranded'] = df['item_descrip'].str.replace(tx.make(brand_list), '', n=1)
print(df)

Output

                 item_descrip            unbranded
0        ebc root beer single     root beer single
1  yic yac big pack freshmint   big pack freshmint
2         froggy jumbo flakes         jumbo flakes
3           jumbo tart warmer          tart warmer
4                  beer jerky                jerky

The function make will build a trie regex. To better understand what is happening, make generates the following regular expression:

\b(?:froggy|tart|beer|yic yac|jumbo|ebc)\b

The argument n=1, means that the pattern will be replaced only one time, from the documentation:

n int, default -1 (all)

Number of replacements to make from start.
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76
  • Thanks @DaniMesejo for your response. I will try this out on Monday when I go back into work and let you know how it goes. – Pysnek313 Dec 04 '20 at 23:43
  • I tried your one liner on my code and got this error `error: nothing to repeat at position 421`. I read online that this is a known python bug? Do you know how I can escape this? – Pysnek313 Dec 07 '20 at 14:07
  • @Pysnek313 do the words have any punctuation or something of that sort? – Dani Mesejo Dec 07 '20 at 14:09
  • definitely has punctuation. Should I run all my text through my pre-tokenizer cleaner? My only concern is that I have brand names that include punctuation in their name, for example: R-D-C, Co. – Pysnek313 Dec 07 '20 at 14:12
  • Try removing the punctuation yes – Dani Mesejo Dec 07 '20 at 14:13
  • I used the Gensim package to strip: numeric, punctuation, non_alphanum, and whitespaces on both my list of brands and the df column. I still am getting the `error: nothing to repeat at position X`. When I look at the record in position X in my df, I dont see what could be wrong.... it looks like a normal item description. Im not sure if its telling me there is not a match and nothing to remove or what because the text at position X looks normal. Is there a way to handle this error? – Pysnek313 Dec 07 '20 at 15:18
  • Sorry but I cannot be of any aid, it seems that there some plus sign or *, but without any further details I cannot tell – Dani Mesejo Dec 07 '20 at 15:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225631/discussion-between-dani-mesejo-and-pysnek313). – Dani Mesejo Dec 07 '20 at 15:24
  • I'm in the chat – Pysnek313 Dec 07 '20 at 16:39