7

I would like to apply a scipy.stats.linregress within Pandas ByGroup. I had looked through the documentation but all I could see was how to apply something to a single column like

grouped.agg(np.sum)

or a function like

grouped.agg('D' : lambda x: np.std(x, ddof=1)) 

But how do I apply a linregress which has TWO inputs X and Y?

DSM
  • 342,061
  • 65
  • 592
  • 494
user1911866
  • 769
  • 1
  • 9
  • 13

1 Answers1

8

The linregress function, as well as many other scipy/numpy functions, accepts "array-like" X and Y, both Series and DataFrame could qualify.

For example:

from scipy.stats import linregress
X = pd.Series(np.arange(10))
Y = pd.Series(np.arange(10))

In [4]: linregress(X, Y)
Out[4]: (1.0, 0.0, 1.0, 4.3749999999999517e-80, 0.0)

In fact, being able to use scipy (and numpy) functions is one of pandas killer features!

So if you have a DataFrame you can use linregress on its columns (which are Series):

linregress(df['col_X'], df['col_Y'])

and if using a groupby you can similarly apply (to each group):

grouped.apply(lambda x: linregress(x['col_X'], x['col_Y']))
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • Thanks Andy, Yes it can accept it. The question is how to do it BYGROUP. For example I have datetime that I have GROUPED into Year and month. I want to do the linear regression for each of the groups then return the values from the lin regression. Also I have a DataFram so how can I apply that using two columns in the DF? Thanks Jason – user1911866 Feb 10 '13 at 09:37
  • @user1911866 also, see [this question and its answer](http://stackoverflow.com/questions/12410438/how-to-use-pandas-groupby-apply-without-adding-an-extra-index). – Andy Hayden Feb 10 '13 at 19:31
  • Thanks once more. How do I exclude NaN values from the linregress calculation? Is there a way of masking values in the grouped calculations? Best wishes, Jason. – user1911866 Feb 11 '13 at 15:20
  • @user1911866 This reminds me a lot of [this answer](http://stackoverflow.com/questions/13930367/interpolating-time-series-in-pandas-using-cubic-spline/13931877#13931877), drop the na first then do the calculation. :) – Andy Hayden Feb 11 '13 at 15:27
  • Ahh. Yes of course. I was wondering if the 'mask' feature in Pandas would work and was thinking along those lines. Excellent. – user1911866 Feb 12 '13 at 01:03