4

I need to do an apply on a dataframe using inputs from multiple rows. As a simple example, I can do the following if all the inputs are from a single row:

df['c'] = df[['a','b']].apply(lambda x: awesome stuff, axis=1) 
# or 
df['d'] = df[['b','c']].shift(1).apply(...) # to get the values from the previous row

However, if I need 'a' from the current row, and 'b' from the previous row, is there a way to do that with apply? I could add a new 'bshift' column and then just use df[['a','bshift']] but it seems there must be a more direct way.

Related but separate, when accessing a specific value in the df, is there a way to combine labeled indexing with integer-offset? E.g. I know the label of the current row but need the row before. Something like df.at['labelIknow'-1, 'a'] (which of course doesn't work). This is for when I'm forced to iterate through rows. Thanks in advance.

Edit: Some info on what I'm doing etc. I have a pandas store containing tables of OHLC bars (one table per security). When doing backtesting, currently I pull the full date range I need for a security into memory, and then resample it into a frequency that makes sense for the test at hand. Then I do some vectorized operations for things like trade entry signals etc. Finally I loop over the data from start to finish doing the actual backtest, e.g. checking for trade entry exit, drawdown etc - this looping part is the part I'm trying to speed up.

fantabolous
  • 21,470
  • 7
  • 54
  • 51
  • I don't think such a thing is possible. What are you trying to achieve anyways? Potentially, `apply()` is the wrong approach to begin with. – FooBar Aug 16 '14 at 13:54
  • I currently iterate to do back testing and I'm trying to get my code to run faster. Each row (point in time) depends on what happened before, but also depends on what's happening at the current time. There's a bunch of stuff (~50 lines of code currently) happening in each iteration but I thought I'd try to get as much of it into applies as possible and see if it helps. – fantabolous Aug 16 '14 at 13:59
  • 1
    Row-wise `Apply()` doesn't magically make your code faster. If there is an operation that you could use to pre-calculate some of your values column wise, you should do so before hand. (As you already do, shifting columns is one of them). – FooBar Aug 16 '14 at 14:07
  • Point taken about apply not being magically faster. Come to think of it I suppose doing multiple applies could well end up being slower than a single large for loop. The first answer to http://stackoverflow.com/questions/24870953/does-iterrows-have-performance-issues/24871316 made me want to try it. – fantabolous Aug 16 '14 at 14:24
  • Added a little background. Considering your point about apply and speed, I think what I'll do is try to vectorize as much as I can, and see what I'm left with, and then come back with more questions as I have them. If I tried to post all my code as it is it'd just be overload. Appreciate your advice. – fantabolous Aug 16 '14 at 14:38

2 Answers2

3

This should directly answer your question and let you use apply, although I'm not sure it's ultimately any better than a two-line solution. It does avoid creating extra variables at least.

df['c'] = pd.concat([ df['a'], df['a'].shift() ], axis=1).apply(np.mean,axis=1)

That will put the mean of 'a' values from the current and previous rows into 'c', for example.

This isn't as general, but for simpler cases you can do something like this (continuing the mean example):

df['c'] = ( df['a'] + df['a'].shift() ) / 2

That is about 10x faster than the concat() method on my tiny example dataset. I imagine that's as fast as you could do it, if you can code it in that style.

You could also look into reshaping the data with stack() and hierarchical indexing. That would be a way to get all your variables into the same row but I think it will likely be more complicated than the concat method or just creating intermediate variables via shift().

JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Thanks. I timed your first method vs adding a shifted column and then deleting it after I finished the apply, and interestingly they were virtually identical (and not very fast) for any df size I tried, so I guess the time is all spent in the apply itself. Indeed the 2nd method (vectorized) is much faster and I'll try to do as much of that as possible but unfortunately I don't think it's possible to vectorize all of it. – fantabolous Aug 17 '14 at 01:53
  • 1
    @fantabolous It's not just the apply, it's also somewhat expensive to concatenate/merge or create a new variable. In my example data, about half the time cost is due to concat and half due to apply. Just replace the apply(mean) with mean() and you'll see it's still slower. – JohnE Aug 17 '14 at 14:14
  • interesting that you should bring that up.. I tried with just `.mean(axis=1)` and on my 10000x2 df of randn, it took about 2.5ms, vs 480ms using `.apply(np.mean,axis=1)`. In both cases that includes adding a new 'bshift' column and then deleting it again. Same thing if I use your .concat method: directly doing `.mean` is far faster. – fantabolous Aug 18 '14 at 03:05
  • 1
    I don't really understand the pandas internals but Jeff has explained in other answers that mean() stays in cythonoized space but apply() will fall back into regular python space, so always avoid apply() if possible. – JohnE Aug 18 '14 at 12:13
1

For the first part, I don't think such a thing is possible. If you update on what you actually want to achieve, I can update this answer.

Also looking at the second part, your data structure seems to be relying an awfully lot on the order of rows. This is typically not how you want to manage your databases. Again, if you tell us what your overall goal is, we may hint you towards a solution (and potentially a better way to structure the data base).

Anyhow, one way to get the row before, if you know a given index label, is to do:

df.ix[:'labelYouKnow'].iloc[-2]

Note that this is not the optimal thing to do efficiency-wise, so you may want to improve your your db structure in order to prevent the need for doing such things.

FooBar
  • 15,724
  • 19
  • 82
  • 171
  • The index is a sorted and resampled (read regularly spaced) timeseries, so the previous bar is always a constant timedelta in the past. For the 2nd part I'm currently adding a column with the data shifted (I could also calculate index by offset, e.g. `.ix[currenttime - DateOffset(seconds=1), 'a']`, but that seemed inefficient) but was just hoping there was some efficient pandas selection magic that I wasn't aware of. Perhaps I was reaching :) – fantabolous Aug 16 '14 at 14:20