0

This question is the follow up question to previous question How to extract only uppercase substring from pandas series?.

Instead of changing the old question, I decided to ask the new question.

My aim is to extract the aggregation method agg and feature name feat from a column named item.

Here is the question:


import numpy as np
import pandas as pd


df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})


regexp = (r'(?P<agg>) '     # agg is the word in uppercase (all other substring is lowercased)
         r'(?P<feat>), '   # 1. if there is no uppercase, whole string is feat
                           # 2. if there is uppercase the substring after example. is feat
                           # e.g. cat ==> cat
                           # cat.N_MOST_COMMON(example.ord)[2] ==> ord
                  
        )

df[['agg','feat']] = df.col.str.extract(regexp,expand=True)

# I am not sure how to build up regexp here.


print(df)

"""
Required output


                                item   agg               feat
0                                num                     num
1                               bool                     bool
2                                cat                     cat
3                 cat.COUNT(example)   COUNT                           # note: here feat is empty
4  cat.N_MOST_COMMON(example.ord)[2]   N_MOST_COMMON     ord
5             cat.FIRST(example.ord)   FIRST             ord
6             cat.FIRST(example.num)   FIRST             num
""";
BhishanPoudel
  • 15,974
  • 21
  • 108
  • 169

1 Answers1

1

For feat, since you already got the answer to agg in your other StackOverflow question, I think you can use the following to extract two different series based off two different patterns that are separated with | and then fillna() one series with another.

  1. ^([^A-Z]*$) should only return the full string if the full string is lowercase
  2. [^a-z].*example\.([a-z]+)\).*$ should only return strings after example. and before ) only if there is uppercase in the string prior to example.

df = pd.DataFrame({'item': ['num','bool', 'cat', 'cat.COUNT(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[1]: 
                                item  feat
0                                num   num
1                               bool  bool
2                                cat   cat
3                 cat.COUNT(example)      
4  cat.N_MOST_COMMON(example.ord)[2]   ord
5             cat.FIRST(example.ord)   ord
6             cat.FIRST(example.num)   num

The above gives you the output you are looking for your sample data and holds to your conditions. However:

  1. What if there are UPPERCASE after example.? Current output would return ''

see example #2 below with some of the data changed according to above point:

df = pd.DataFrame({'item': ['num','cat.count(example.AAA)', 'cat.count(example.aaa)', 'cat.count(example)','cat.N_MOST_COMMON(example.ord)[2]','cat.FIRST(example.ord)','cat.FIRST(example.num)']})

s = df['item'].str.extract('^([^A-Z]*$)|[^a-z].*example\.([a-z]+)\).*$', expand=True)
df['feat'] = s[0].fillna(s[1]).fillna('')
df
Out[2]: 
                                item                    feat
0                                num                     num
1             cat.count(example.AAA)                        
2             cat.count(example.aaa)  cat.count(example.aaa)
3                 cat.count(example)      cat.count(example)
4  cat.N_MOST_COMMON(example.ord)[2]                     ord
5             cat.FIRST(example.ord)                     ord
6             cat.FIRST(example.num)                     num
David Erickson
  • 16,433
  • 2
  • 19
  • 35