0

I have an identifier function that goes through all the elements in a DataFrame column and then assigns them a category. The code as I have it now looks like this;

def fruit_replace(x):
    fruit_quantity = re.search(r'(\\d+)quantity', x)
    if 'apple' in x:
        return 'green'
    elif 'pear' in x:
        return 'green'
    elif 'cherry' in x:
        return 'red'
    elif 'banana' in x:
        return 'yellow'
    elif fruit_quantity != None:
        return fruit_quantity.group(0)

I apply this in a lambda function on the DataFrame and assign the results in a new column. Unfortunately it is a bit complicated due to the fruit_quantity search being different from the others.

The process should yield something like this;

Original DataFrame

pd.DataFrame({'fruit_type': ['big apple', 'small cherry', 'jerry 10quantity']})

Into this

pd.DataFrame({'fruit_type': ['big apple', 'small cherry', 'peach 10quantity'],
              'category': ['green', 'red', 10]})

My question is if this code can be improved in a more pythonic or pandas way, and possibly vectorized? I have to apply this to about 5 million lines and this takes some time.

Many thanks!

jim mako
  • 541
  • 2
  • 9
  • 28
  • Please provide a sample data set (5-7 rows) and desired data set. Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – MaxU - stand with Ukraine Feb 19 '17 at 20:03

1 Answers1

1

you can use boolean indexing in conjunction with str.contains() method:

df['category'] = np.nan

df.loc[df.fruit_type.str.contains(r'\b(?:apple|pear)\b'), 'category'] = 'green'
df.loc[df.fruit_type.str.contains(r'\b(?:cherry)\b'), 'category'] = 'red'
df.loc[df.fruit_type.str.contains(r'\b(?:banana)\b'), 'category'] = 'yellow'
df.loc[df['category'].isnull() & (df.fruit_type.str.contains(r'\d+q')), 'category'] = \
    df.fruit_type.str.extract(r'(\d+)q', expand=False)

Result:

In [270]: df
Out[270]:
         fruit_type category
0         big apple    green
1      small cherry      red
2  jerry 10quantity       10
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419