Functions of pandas data frames and side effects

Question

I want to write a function that takes as input a Pandas data frame and returns the only the rows with an average greater than some specified threshold. The function works, but it has a side effect of changing the input, which I don't want to do.

def Remove_Low_Average(df, sample_names, average_threshold=30):
    data_frame = df
    data_frame['Mean'] = np.mean(data_frame[sample_names], axis=1)
    data_frame = data_frame[data_frame.Mean > 30]
    return data_frame.reset_index(drop=True)

Example:

In [7]: junk_data = DataFrame(np.random.randn(5,5), columns=['a', 'b', 'c', 'd', 'e'])
In [8]: Remove_Low_Average(junk_data, ['a', 'b', 'c'], average_threshold=0)
In [9]: junk_data.columns
Out[9]: Index([u'a', u'b', u'c', u'd', u'e', u'Mean'], dtype='object')

So junk_data now has 'Mean' in its columns even though this was never assigned in the function. I realize I could do this in a simpler manner, but this illustrates a problem I've been having regularly I can't figure out why. I figure that this has to be a well-known thing, but I don't know how to get this side effect to stop happening.

EDIT: EdChum's link below answers the question.

What did you expect this line to do: `data_frame['Mean'] = np.mean(data_frame[sample_names], axis=1)`? if you just wanted to calculate a local variable then why not just `mean = np.mean(data_frame[sample_names], axis=1)`? — EdChum, Jul 15 '14 at 20:46
@EdChum This is a minimal example that illustrates a problem that I am having with a larger function. I *expect* that because I am addressing the variable data_frame (which is not passed to the function) that it would leave df (which is) unchanged. Clearly, that is not the case. I am wondering why, what this is called, and how to work around it, and where in the Pandas docs I might read more about it. — SmearingMap, Jul 15 '14 at 20:50
Well you have to understand that you are most likely taking a view of the dataframe so any form of assigning will affect the original dataframe, in your example it seems completely unneceessary to create a new column for this calculation. — EdChum, Jul 15 '14 at 20:52
@EdChum I agree, but in another function I need to take a data frame and compute various statistics and append those to the output, but I can't do that without changing the original it seems. This illustrates the kind of things I need to do. — SmearingMap, Jul 15 '14 at 20:54
@EdChum What do you mean by taking a view of the data frame and why does it mean that any form of assigning will affect the original? This sounds like exactly what I need to know about. — SmearingMap, Jul 15 '14 at 20:56
It depends on what you're doing but your example doesn't represent your problem as it's not a problem in that example, you could just return `df[np.mean(df[sample_names],axis=1) > 30]` — EdChum, Jul 15 '14 at 20:56
see http://stackoverflow.com/questions/13419822/pandas-dataframe-copy-by-value so basically if you want to avoid modifying the original then perform a deep copy by calling `.copy()` — EdChum, Jul 15 '14 at 20:58
@EdChum Thanks for the link! That's exactly what I needed to see. I didn't know about the pass-by-reference/pass-by-value distinction. Thanks for helping me learn something new today. — SmearingMap, Jul 15 '14 at 21:08

score 0 · Accepted Answer · edited May 23 '17 at 11:57

0

@EdChum answered this in the comments:

see this page so basically if you want to avoid modifying the original then perform a deep copy by calling .copy()

edited May 23 '17 at 11:57

Community

1
1

answered Jul 15 '14 at 21:06

SmearingMap

320
1
11

score 0 · Answer 2 · answered Jul 15 '14 at 23:04

You don't need to copy the old dataframe, just don't assign a new column :)

def remove_low_average(df, sample_names, average_threshold=30):
    mean = df[sample_names].mean(axis=1)
    return df.ix[mean > average_threshold]

# then use it as:
df = remove_low_average(df, ['a', 'b'])

Functions of pandas data frames and side effects

2 Answers2