0

I have a dataframe that contains a column containing a string value. I need to replace each value in that column with the results of a function. I'd like to do this without iterating over thousands of rows. The function takes a term and returns the approved new value of that term.

Example: getPreferredTerm('STAINED') would return 'DISCOLORED' so values of 'STAINED' in the P_TERM column would all be replaced by 'DISCOLORED'.

I'm struggling with using numpy to accomplish this.

df['P_TERM'] = getPreferredTerm(df['P_TERM'])

the getPreferredTerm function is as follows:

def getPreferredTerm(stresc): 
    # NOTE" obsData is a dataframe containing legacy terms in a 
    # column called 'STRESC' and preferred terms in a column 
    # named 'PTERM' so this function takes a legacy term as input 
    # and returns a preferred term 
    try:    
        df = obsData.loc[(obsData['STRESC'] == stresc)].iloc[0]['P_TERM'] 
        pterm = df 
    except Exception as e:  
        pterm = 'UNMAPPED' 
    return pterm

is it possible to vectorize this function if I pass it a series instead of a single value?

summersmd
  • 47
  • 2
  • 6

1 Answers1

0

Use pd.Series.apply to apply a function to each element in a series:

df['P_TERM'] = df['P_TERM'].apply(getPreferredTerm)

Note that such a calculation will not be vectorised, nor is it in-place. Pandas will perform a loop to apply your function to each value in the series. The resulting series is then assigned back to the dataframe.


If you want to use a dictionary to map your values, you can use pd.Series.map. This operation is optimized by Pandas.

d = {'STAINED': 'DISCOLORED'}
df['P_TERM'] = df['P_TERM'].map(d).fillna(df['P_TERM'])

An alternative is to use pd.Series.replace, but see this note on performance.


Vectorised method for your updated requirements:

s = obsData.drop_duplicates(subset=['STRESC']).set_index('STRESC')['P_TERM']
df['P_TERM'] = df['P_TERM'].map(s).fillna('UNMAPPED')
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Will this be faster than iterating through each row? Is there a better way to do this? – summersmd Apr 18 '18 at 13:59
  • This will be no faster than iterating each row. If you want a vectorised method, you need to show us the code for `getPreferredTerm`, which may or may not be vectorisable. – jpp Apr 18 '18 at 14:00
  • def getObsPTerm (stresc): # NOTE" obsData is a dataframe containing legacy terms in a # column called 'STRESC' and preferred terms in a column # named 'PTERM' so this function takes a legacy term as input # and returns a preferred term try: df = obsData.loc[(obsData['STRESC'] == stresc)].iloc[0]['P_TERM'] pterm = df except Exception as e: pterm = 'UNMAPPED' return pterm – summersmd Apr 18 '18 at 14:06
  • @summersmd, Sorry I cannot read code in your comment. Please **[edit](https://stackoverflow.com/posts/49901415/edit)** your question if you need to provide more data / code. – jpp Apr 18 '18 at 14:07
  • @summersmd, see update. You haven't provided data with expected output, so I cannot test to see if this is what you want. – jpp Apr 18 '18 at 14:40
  • @jp... Thanks for the assist. Your suggestion works great (s = obsData.drop_duplicates(subset=['STRESC']).set_index('STRESC')['P_TERM'] df['P_TERM'] = df['P_TERM'].map(s).fillna('UNMAPPED')) I have a new wrinkle. What if my def getPreferredTerm(stresc) function actually required 2 inputs (stresc, category)? If the preferred term was looked up using stresc and a category, example: STRESC CATEGORY P_TERM Stained Hair Discolored – summersmd Apr 18 '18 at 16:48