7

I have a pandas dataframe that looks like:

d = {'some_col' : ['A', 'B', 'C', 'D', 'E'],
     'alert_status' : [1, 2, 0, 0, 5]}
df = pd.DataFrame(d)

Quite a few tasks at my job require the same tasks in pandas. I'm beginning to write standardized functions that will take a dataframe as a parameter and return something. Here's a simple one:

def alert_read_text(df, alert_status=None):
    if (alert_status is None):
        print 'Warning: A column name with the alerts must be specified'
    alert_read_criteria = df[alert_status] >= 1
    df[alert_status].loc[alert_read_criteria] = 1
    alert_status_dict = {0 : 'Not Read',
                         1 : 'Read'}
    df[alert_status] = df[alert_status].map(alert_status_dict)
    return df[alert_status]

I'm looking to have the function return a series. This way, one could add a column to an existing data frame:

df['alert_status_text'] = alert_read_text(df, alert_status='alert_status')

However, currently, this function will correctly return a series, but also modifies the existing column. How do you make it so the original column passed in does not get modified?

DataSwede
  • 5,251
  • 10
  • 40
  • 66

2 Answers2

6

As you've discovered the passed in dataframe will be modified as params are passed by reference, this is true in python and nothing to do with pandas as such.

So if you don't want to modify the passed df then take a copy:

def alert_read_text(df, alert_status=None):
    if (alert_status is None):
        print 'Warning: A column name with the alerts must be specified'
    copy = df.copy()
    alert_read_criteria = copy[alert_status] >= 1
    copy[alert_status].loc[alert_read_criteria] = 1
    alert_status_dict = {0 : 'Not Read',
                         1 : 'Read'}
    copy[alert_status] = copy[alert_status].map(alert_status_dict)
    return copy[alert_status]

Also see related: pandas dataframe, copy by value

Community
  • 1
  • 1
EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Thanks! That solved it. Is there a standardized variable name that is common to use as the copy within functions like this? Meaning, pandas dataframes are usually abbreviated as df, etc. Do people typically name it as 'copy'? Or is it typically whatever you come up with? – DataSwede Jul 31 '14 at 22:21
  • @DataSwede that was just a quick hack example, you could call it whatever you want to be honest, `tmp` would also do – EdChum Jul 31 '14 at 22:22
0

You don't need to set any value on your DataFrame on your example.

def alert_read_text(df, alert_status):
    alert_read_criteria = df[alert_status] >= 1
    alert_status_dict = {False : 'Not Read',
                     True : 'Read'}
    return alert_read_criteria.map(alert_status_dict)

Since the alert_read_criteria Series has the same index as df, you can still do df['alert_status_text'] = alert_read_text(df, alert_status='alert_status') afterwards.

From my experience, assigning columns to a DataFrame passed as parameter while not intending to return such DataFrame is generally a bad pattern. You might be hiding side-effects of the function as well.

VaM
  • 180
  • 2
  • 13