0

I want to write a function that takes as input a Pandas data frame and returns the only the rows with an average greater than some specified threshold. The function works, but it has a side effect of changing the input, which I don't want to do.

def Remove_Low_Average(df, sample_names, average_threshold=30):
    data_frame = df
    data_frame['Mean'] = np.mean(data_frame[sample_names], axis=1)
    data_frame = data_frame[data_frame.Mean > 30]
    return data_frame.reset_index(drop=True)

Example:

In [7]: junk_data = DataFrame(np.random.randn(5,5), columns=['a', 'b', 'c', 'd', 'e'])
In [8]: Remove_Low_Average(junk_data, ['a', 'b', 'c'], average_threshold=0)
In [9]: junk_data.columns
Out[9]: Index([u'a', u'b', u'c', u'd', u'e', u'Mean'], dtype='object')

So junk_data now has 'Mean' in its columns even though this was never assigned in the function. I realize I could do this in a simpler manner, but this illustrates a problem I've been having regularly I can't figure out why. I figure that this has to be a well-known thing, but I don't know how to get this side effect to stop happening.

EDIT: EdChum's link below answers the question.

SmearingMap
  • 320
  • 1
  • 11
  • What did you expect this line to do: `data_frame['Mean'] = np.mean(data_frame[sample_names], axis=1)`? if you just wanted to calculate a local variable then why not just `mean = np.mean(data_frame[sample_names], axis=1)`? – EdChum Jul 15 '14 at 20:46
  • @EdChum This is a minimal example that illustrates a problem that I am having with a larger function. I *expect* that because I am addressing the variable data_frame (which is not passed to the function) that it would leave df (which is) unchanged. Clearly, that is not the case. I am wondering why, what this is called, and how to work around it, and where in the Pandas docs I might read more about it. – SmearingMap Jul 15 '14 at 20:50
  • Well you have to understand that you are most likely taking a view of the dataframe so any form of assigning will affect the original dataframe, in your example it seems completely unneceessary to create a new column for this calculation. – EdChum Jul 15 '14 at 20:52
  • @EdChum I agree, but in another function I need to take a data frame and compute various statistics and append those to the output, but I can't do that without changing the original it seems. This illustrates the kind of things I need to do. – SmearingMap Jul 15 '14 at 20:54
  • @EdChum What do you mean by taking a view of the data frame and why does it mean that any form of assigning will affect the original? This sounds like exactly what I need to know about. – SmearingMap Jul 15 '14 at 20:56
  • It depends on what you're doing but your example doesn't represent your problem as it's not a problem in that example, you could just return `df[np.mean(df[sample_names],axis=1) > 30]` – EdChum Jul 15 '14 at 20:56
  • see http://stackoverflow.com/questions/13419822/pandas-dataframe-copy-by-value so basically if you want to avoid modifying the original then perform a deep copy by calling `.copy()` – EdChum Jul 15 '14 at 20:58
  • @EdChum Thanks for the link! That's exactly what I needed to see. I didn't know about the pass-by-reference/pass-by-value distinction. Thanks for helping me learn something new today. – SmearingMap Jul 15 '14 at 21:08

2 Answers2

0

@EdChum answered this in the comments:

see this page so basically if you want to avoid modifying the original then perform a deep copy by calling .copy()

Community
  • 1
  • 1
SmearingMap
  • 320
  • 1
  • 11
0

You don't need to copy the old dataframe, just don't assign a new column :)

def remove_low_average(df, sample_names, average_threshold=30):
    mean = df[sample_names].mean(axis=1)
    return df.ix[mean > average_threshold]

# then use it as:
df = remove_low_average(df, ['a', 'b'])
U2EF1
  • 12,907
  • 3
  • 35
  • 37