I have an identifier function that goes through all the elements in a DataFrame column and then assigns them a category. The code as I have it now looks like this;
def fruit_replace(x):
fruit_quantity = re.search(r'(\\d+)quantity', x)
if 'apple' in x:
return 'green'
elif 'pear' in x:
return 'green'
elif 'cherry' in x:
return 'red'
elif 'banana' in x:
return 'yellow'
elif fruit_quantity != None:
return fruit_quantity.group(0)
I apply this in a lambda function on the DataFrame and assign the results in a new column. Unfortunately it is a bit complicated due to the fruit_quantity
search being different from the others.
The process should yield something like this;
Original DataFrame
pd.DataFrame({'fruit_type': ['big apple', 'small cherry', 'jerry 10quantity']})
Into this
pd.DataFrame({'fruit_type': ['big apple', 'small cherry', 'peach 10quantity'],
'category': ['green', 'red', 10]})
My question is if this code can be improved in a more pythonic or pandas way, and possibly vectorized? I have to apply this to about 5 million lines and this takes some time.
Many thanks!