1

I have a problem similar to this question.

I am importing a large .csv file into pandas for a project. One column in the dataframe contains ultimately 4 columns of concatenated data(I can't control the data I receive) a Brand name (what I want to remove), a product description, product size and UPC. Please note that the brand description in the Item_UPC does not always == Brand.

for example

import pandas as pd
df = pd.DataFrame({'Item_UPC': ['fubar baz dr frm prob onc dly wmn ogc 30vcp 06580-66-832',
 'xxx stuff coll tides 20 oz 09980-66-832',
 'hel world sambucus elder 60 chw 0392-67-491',
 'northern cold ultimate  180 sg 06580-66-832',
 'ancient nuts  boogs 16oz 58532-42-123 '], 
                   'Brand': ['FUBAR OF BAZ',
 'XXX STUFF',
 'HELLO WORLD',
 'NORTHERN COLDNITES',
 'ANCIENT NUTS']}) 

I want to remove the brand name from the Item_UPC column as this is redundant information among other issues. Currently I have a function, that takes the new df and pulls out the UPC and cleans it up to match what one finds on bottles and another database I have for a single brand, minus the last check sum digit.

def clean_upc(df):
    #take in a dataframe, expand the number of columns into a temp
    #dataframe
    temp = df["Item_UPC"].str.rsplit(" ", n=1, expand = True)
    #add columns  to main dataframe from Temp
    df.insert(0, "UPC", temp[1])
    df.insert(1, "Item", temp[0])

    #drop original combined column
    df.drop(columns= ["Item_UPC"], inplace=True)
    #remove leading zero on and hyphens in UPC. 
    
    df["UPC"]= df["UPC"].apply(lambda x : x[1:] if x.startswith("0") else x)
    df["UPC"]=df["UPC"].apply(lambda x :x.replace('-', ''))
    col_names = df.columns
    #make all columns lower case to ease searching
    for cols in col_names:
        df[cols] = df[cols].apply(lambda x: x.lower() if type(x) == str else x)

after running this I have a data frame with three columns UPC, Item, Brand

The data frame has over 300k rows and 2300 unique brands in it. There is also no consistent manner in which they shorten names. When I run the following code

temp = df["Item"].str.rsplit(" ", expand = True)

temp has a shape of

temp.shape 
 (329868, 13)

which makes manual curating a pain when most of columns 9-13 are empty. Currently my logic is to first split brand in to 2 while dropping the first column in temp

brand = df["brand"].str.rsplit(" ",  n=1,expand = True) #produce a dataframe of two columns
temp.drop(columns= [0], inplace=True)

and then do a string replace on temp[1] to see if it contains regex in brand[1] and then replace it with " " or vice versa, and then concatenate temp back together (

temp["combined"] = temp[1] + temp[2]....+temp[13]

and replace the existing Item column with the combined column

df["Item"] = temp["combined"]

or is there a better way all around? There are many brands that only have one name, which may make everything faster. I have been struggling with regex and logically it seems like this would be faster, I just have a hard time thinking of the syntax to make it work.

  • 2
    can you clarify your exact question? Is there a solution you're looking for, or behavior your wish to optimize? – anon01 Aug 26 '20 at 00:44
  • As an example, I want to change the last row 'ancient nuts boogs 16oz 58532-42-123 ' to 'boogs 16oz'. My function for removing and cleaning the UPC works fine, but I cant figure out how to remove 'ancient nuts' from that row, or any other brand name. – ClaytonSummitt Aug 26 '20 at 00:50

1 Answers1

0

Because the input does not follow any well-defined rules, this looks like more of an optimization problem. You can start by stripping exact matches:

df["Item_cleaned"] = df.apply(lambda x: x.Item_UPC.lstrip(x.Brand.lower()), axis=1)

output:

                                            Item_UPC               Brand                                    Item_cleaned
0  fubar baz dr frm prob onc dly wmn ogc 30vcp 06...        FUBAR OF BAZ  dr frm prob onc dly wmn ogc 30vcp 06580-66-832
1            xxx stuff coll tides 20 oz 09980-66-832           XXX STUFF                   coll tides 20 oz 09980-66-832
2        hel world sambucus elder 60 chw 0392-67-491         HELLO WORLD               sambucus elder 60 chw 0392-67-491
3        northern cold ultimate  180 sg 06580-66-832  NORTHERN COLDNITES                   ultimate  180 sg 06580-66-832
4             ancient nuts  boogs 16oz 58532-42-123         ANCIENT NUTS                        boogs 16oz 58532-42-123

This method should will strip any exact matches and output to a new column Item_cleaned. If your input is abbreviated, you should apply a more complex fuzzy string matching algorithm. This may be prohibitively slow, however. In that case, I would recommend a two-step method, saving all rows that have been cleaned by the approach above, and do a second pass for more complicated cleaning as needed.

anon01
  • 10,618
  • 8
  • 35
  • 58
  • There is a reason this was the last thing on my to do list. Many, though not all, of the brands in ```Item_UPC``` are identical to ```Brand```. Thank you for the general approach and answer. Do you prefer x.Brand.lower() over x["Brand"].lower() ? None of the column names are to my knowledge pandas methods so there should be no collisions. – ClaytonSummitt Aug 26 '20 at 01:23
  • So this works great on my synthetic data set, which I changed the names of the brands. But on my real data set, it doesn't behave quite the same. For example in row index 3, after the .lstrip( x.Brand.lower()), it is striped from ```northern cold ultimate 180 sg 06580-66-832 ``` to ```mate 180 sg``` , and it should be ```ultimate 180 sg```. That is after I run my clean_upc() function. but for the ```ancient nuts```, it works perfectly. So perfect matches it is perfect else it has issues I need to work on. (thinking out loud here) – ClaytonSummitt Aug 26 '20 at 01:50