4

Assume I have a DataFrame of the following form where the first column is a random number, and the other columns will be based on the value in the previous column.

enter image description here

For ease of use, let's say I want each number to be the previous one squared. So it would look like the below.

enter image description here

I know I can write a pretty simple loop to do this, but I also know looping is not usually the most efficient in python/pandas. How could this be done with apply() or rolling_apply()? Or, otherwise be done more efficiently?

My (failed) attempts below:

In [12]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})

In [13]: a
Out[13]: 
   0  1  2  3
0  1  0  0  0
1  2  0  0  0
2  3  0  0  0
3  4  0  0  0
4  5  0  0  0

In [14]: a = a.apply(lambda x: x**2)

In [15]: a
Out[15]: 
    0  1  2  3
0   1  0  0  0
1   4  0  0  0
2   9  0  0  0
3  16  0  0  0
4  25  0  0  0


In [16]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})

In [17]: pandas.rolling_apply(a,1,lambda x: x**2)
C:\WinPython64bit\python-3.5.2.amd64\lib\site-packages\spyderlib\widgets\externalshell\start_ipython_kernel.py:1: FutureWarning: pd.rolling_apply is deprecated for DataFrame and will be removed in a future version, replace with 
        DataFrame.rolling(center=False,window=1).apply(args=<tuple>,kwargs=<dict>,func=<function>)
  # -*- coding: utf-8 -*-
Out[17]: 
      0    1    2    3
0   1.0  0.0  0.0  0.0
1   4.0  0.0  0.0  0.0
2   9.0  0.0  0.0  0.0
3  16.0  0.0  0.0  0.0
4  25.0  0.0  0.0  0.0

In [18]: a = pandas.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})

In [19]: a = a[:-1]**2

In [20]: a
Out[20]: 
    0  1  2  3
0   1  0  0  0
1   4  0  0  0
2   9  0  0  0
3  16  0  0  0

In [21]: 

So, my issue is mostly how to refer to the previous column value in my DataFrame calculations.

Kyle
  • 2,543
  • 2
  • 16
  • 31

4 Answers4

4

What you're describing is a recurrence relation, and I don't think there is currently any non-loop way to do that. Things like apply and rolling_apply still rely on having all the needed data available before they begin, and outputting all the result data at once at the end. That is, they don't allow you to compute the next value using earlier values of the same series. See this question and this one as well as this pandas issue.

In practical terms, for your example, you only have three columns you want to fill in, so doing a three-pass loop (as shown in some of the other answers) will probably not be a major performance hit.

Community
  • 1
  • 1
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • I appreciate your response. My actual DataFrame is much larger and contains some logic. This is just a simplified example to learn how best to approach this from an efficiency standpoint. The links are helpful. – Kyle Mar 23 '17 at 18:34
3
a[1] = a[0].apply(lambda x: x**2)
a[2] = a[1].apply(lambda x: x**2)
a[3] = a[2].apply(lambda x: x**2)

will give you

    0   1   2   3
0   1   1   1   1
1   2   4   16  256
2   3   9   81  6561
3   4   16  256 65536
4   5   25  625 390625
Vaishali
  • 37,545
  • 5
  • 58
  • 86
3

Unfortunately there isn't a way of doing this with no loops, as far as I know. However, you don't have to loop through every value, just each column. You can just call apply on the previous column and set the next one to the returned value:

a = pd.DataFrame({0:[1,2,3,4,5],1:0,2:0,3:0})

for i in range(3):
    a[i+1] = a[i].apply(lambda x: x**2)
Kewl
  • 3,327
  • 5
  • 26
  • 45
  • I like this. Nice and simple. my previous attempt with a full loop (and my full data source, with additional logic, and applying scipy.fv()) took 2 mins, this approach took under a second. I feel really stupid :). – Kyle Mar 23 '17 at 19:14
3

In this special case, we know this about the columns

  • 0 will be what ever is there to the power of 1
  • 1 will be what ever is in column 0 to the power of 2
  • 2 will be what ever is in column 1 to the power of 2...
    • or will be what ever is in column 0 to the power of 4
  • 3 will be what ever is in column 2 to the power of 2...
    • or will be what ever is in column 1 to the power of 4...
    • or will be what ever is in column 0 to the power of 8

So we can indeed vectorize your example with

np.power(df.values[:, [0]], np.power(2, np.arange(4)))

array([[     1,      1,      1,      1],
       [     2,      4,     16,    256],
       [     3,      9,     81,   6561],
       [     4,     16,    256,  65536],
       [     5,     25,    625, 390625]])

Wrap this in a pretty dataframe

pd.DataFrame(
    np.power(df.values[:, [0]], np.power(2, np.arange(4))),
    df.index, df.columns)

   0   1    2       3
0  1   1    1       1
1  2   4   16     256
2  3   9   81    6561
3  4  16  256   65536
4  5  25  625  390625
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I really like this approach, and this certainly does answer my question. I accepted another answer however, because my question was generalized. I should have figured there was a more specialized way to achieve my specific question. – Kyle Mar 23 '17 at 19:11
  • @Kyle a few points. One: Know that you can accept another answer instead if you find it is more appropriate than the current accepted answer. In fact it is encouraged. I've had more than a few of my answers un-accepted and I'm ok with that. Two: sometimes more difficult questions require a bit more time to answer. If you want a quality answer to a challenging question, be patient. There is no need to rush accepting the answer. You can also be forthright to those who've answered already in that you are waiting. Three: I'm less concerned with 15 rep points and more so about what's right. – piRSquared Mar 23 '17 at 19:15
  • 1
    @Kyle all that said. I'm totally fine with Kewl's answer. Others will vote this up or down or not as they see fit. :-) – piRSquared Mar 23 '17 at 19:16