I have a problem similar to this question.
I am importing a large .csv file into pandas for a project. One column in the dataframe contains ultimately 4 columns of concatenated data(I can't control the data I receive) a Brand name (what I want to remove), a product description, product size and UPC. Please note that the brand description in the Item_UPC does not always == Brand.
for example
import pandas as pd
df = pd.DataFrame({'Item_UPC': ['fubar baz dr frm prob onc dly wmn ogc 30vcp 06580-66-832',
'xxx stuff coll tides 20 oz 09980-66-832',
'hel world sambucus elder 60 chw 0392-67-491',
'northern cold ultimate 180 sg 06580-66-832',
'ancient nuts boogs 16oz 58532-42-123 '],
'Brand': ['FUBAR OF BAZ',
'XXX STUFF',
'HELLO WORLD',
'NORTHERN COLDNITES',
'ANCIENT NUTS']})
I want to remove the brand name from the Item_UPC
column as this is redundant information among other issues. Currently I have a function, that takes the new df and pulls out the UPC and cleans it up to match what one finds on bottles and another database I have for a single brand, minus the last check sum digit.
def clean_upc(df):
#take in a dataframe, expand the number of columns into a temp
#dataframe
temp = df["Item_UPC"].str.rsplit(" ", n=1, expand = True)
#add columns to main dataframe from Temp
df.insert(0, "UPC", temp[1])
df.insert(1, "Item", temp[0])
#drop original combined column
df.drop(columns= ["Item_UPC"], inplace=True)
#remove leading zero on and hyphens in UPC.
df["UPC"]= df["UPC"].apply(lambda x : x[1:] if x.startswith("0") else x)
df["UPC"]=df["UPC"].apply(lambda x :x.replace('-', ''))
col_names = df.columns
#make all columns lower case to ease searching
for cols in col_names:
df[cols] = df[cols].apply(lambda x: x.lower() if type(x) == str else x)
after running this I have a data frame with three columns
UPC, Item, Brand
The data frame has over 300k rows and 2300 unique brands in it. There is also no consistent manner in which they shorten names. When I run the following code
temp = df["Item"].str.rsplit(" ", expand = True)
temp has a shape of
temp.shape
(329868, 13)
which makes manual curating a pain when most of columns 9-13 are empty. Currently my logic is to first split brand in to 2 while dropping the first column in temp
brand = df["brand"].str.rsplit(" ", n=1,expand = True) #produce a dataframe of two columns
temp.drop(columns= [0], inplace=True)
and then do a string replace on temp[1] to see if it contains regex in brand[1] and then replace it with " "
or vice versa, and then concatenate temp back together (
temp["combined"] = temp[1] + temp[2]....+temp[13]
and replace the existing Item
column with the combined
column
df["Item"] = temp["combined"]
or is there a better way all around? There are many brands that only have one name, which may make everything faster. I have been struggling with regex and logically it seems like this would be faster, I just have a hard time thinking of the syntax to make it work.