I have a dataframe like so:
df = pd.DataFrame({'item_descrip': ['ebc root beer single',
'yic yac big pack freshmint',
'froggy jumbo flakes',
'jumbo tart warmer',
'beer jerky'
]
})
I have a list like so:
brand_list = ['ebc', 'yic yac', 'beer', 'jumbo', 'tart', 'froggy']
I want to match strings in the brand_list
to the strings in the item_descrip
column and remove the matches in the item_descrip
column. I want to create another column unbranded
that contains the cleaned strings from item_descrip
.
My problem is that I have a very large brand_list
and some of the strings from this list are matching multiple times in the item_descrip
column. My desired output is if a match is already found for one row, then skip that row.
Desired output:
| | item_descrip | unbranded |
|---:|:-----------------------------------|:-----------------------------------|
| 0 | ebc root beer single | root beer single |
| 1 | yic yac big pack freshmint singles | big pack freshmint singles |
| 2 | froggy jumbo flakes | jumbo flakes |
| 3 | jumbo tart warmer | tart warmer |
| 4 | beer jerky | jerky |
This is the code that works to remove matches, but it removes all matches in the item_descrip
column. For example, in my brand_list
I have ebc
and beer
in the list. For the first record, I only want ebc
to be removed and not beer
since a match was already made. If a match is made on the first part of the string, then don't process that record any further and go onto the next.
So basically, it seems like an if statement could go into the list comprehension, but I'm not sure how to write something that says: if matched pass, else keep searching.
df['unbranded'] = [' '.join([y for y in x.split() if not y.startswith(tuple(brand_list))]) for x in df['item_descrip']]
I got the most of this one-liner here:
https://stackoverflow.com/questions/51666374/how-to-remove-strings-present-in-a-list-from-a-column-in-pandas