Python - vectorizing regex search to classify

Question

I have an identifier function that goes through all the elements in a DataFrame column and then assigns them a category. The code as I have it now looks like this;

def fruit_replace(x):
    fruit_quantity = re.search(r'(\\d+)quantity', x)
    if 'apple' in x:
        return 'green'
    elif 'pear' in x:
        return 'green'
    elif 'cherry' in x:
        return 'red'
    elif 'banana' in x:
        return 'yellow'
    elif fruit_quantity != None:
        return fruit_quantity.group(0)

I apply this in a lambda function on the DataFrame and assign the results in a new column. Unfortunately it is a bit complicated due to the fruit_quantity search being different from the others.

The process should yield something like this;

Original DataFrame

pd.DataFrame({'fruit_type': ['big apple', 'small cherry', 'jerry 10quantity']})

Into this

pd.DataFrame({'fruit_type': ['big apple', 'small cherry', 'peach 10quantity'],
              'category': ['green', 'red', 10]})

My question is if this code can be improved in a more pythonic or pandas way, and possibly vectorized? I have to apply this to about 5 million lines and this takes some time.

Many thanks!

Please provide a sample data set (5-7 rows) and desired data set. Please read [how to make good reproducible pandas examples](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — MaxU - stand with Ukraine, Feb 19 '17 at 20:03

score 1 · Accepted Answer · answered Feb 19 '17 at 20:17

you can use boolean indexing in conjunction with str.contains() method:

df['category'] = np.nan

df.loc[df.fruit_type.str.contains(r'\b(?:apple|pear)\b'), 'category'] = 'green'
df.loc[df.fruit_type.str.contains(r'\b(?:cherry)\b'), 'category'] = 'red'
df.loc[df.fruit_type.str.contains(r'\b(?:banana)\b'), 'category'] = 'yellow'
df.loc[df['category'].isnull() & (df.fruit_type.str.contains(r'\d+q')), 'category'] = \
    df.fruit_type.str.extract(r'(\d+)q', expand=False)

Result:

In [270]: df
Out[270]:
         fruit_type category
0         big apple    green
1      small cherry      red
2  jerry 10quantity       10

Python - vectorizing regex search to classify

1 Answers1