2

Recently, I am converting from SAS to Python pandas. One question I have is that does pandas have a retain like function in SAS,so that I can dynamically referencing the last record. In the following code, I have to manually loop through each line and reference the last record. It seems pretty slow compared to the similar SAS program. Is there anyway that makes it more efficient in pandas? Thank you.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 1, 1, 1], 'B': [0, 0, 1, 0]})
df['C'] = np.nan
df['lag_C'] = np.nan
for row in df.index:
    if row == df.head(1).index:
        df.loc[row, 'C'] = (df.loc[row, 'A'] == 0) + 0
    else:
        if (df.loc[row, 'B'] == 1):
            df.loc[row, 'C'] = 1
        elif (df.loc[row, 'lag_C'] == 0):
            df.loc[row, 'C'] = 0
        elif (df.loc[row, 'lag_C'] != 0):
            df.loc[row, 'C'] = df.loc[row, 'lag_C'] + 1
    if row != df.tail(1).index:
        df.loc[row +1, 'lag_C'] = df.loc[row, 'C']
towin
  • 23
  • 1
  • 6
  • You should vectorize or write in cython/numba. IMO your code is very difficult to reason about, but I suspect there's a readable/efficient solution... – Andy Hayden Nov 06 '15 at 17:15
  • can you state in simple terms what you want to achieve, i.e. what is the expected result? – miraculixx Nov 06 '15 at 18:38

1 Answers1

0

Very complicated algorithm, but I try vectorized approach.
If I understand it, there can be use cumulative sum as using in this question. Last column lag_C is shifted column C.

But my algorithm can't be use in first rows of df, because only these rows are counted from first value of column A and sometimes column B. So I created column D, where are distinguished rows and latter are copy to output column C, if conditions are True.

I changed input data and test first problematic rows. I try test all three possibilities of first 3 rows of column B with first row of column A.

My input condition are:
Column A and B are only 1 or O. Column C and lag_C are helper columns with only NaN.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
df1 = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})

#cumulative sum of column B
df1['C'] = df1['B'].cumsum()
df1['lag_C'] = 1
#first 'group' with min value is problematic, copy to column D for latter use
df1.loc[df1['C'] == df1['C'].min() ,'D'] = df1['B']
#cumulative sums of groups to column C
df1['C']= df1.groupby(['C'])['lag_C'].cumsum()
#correct problematic states in column C, use value from D
if (df1['A'].loc[0] == 1):
    df1.loc[df1['D'].notnull() ,'C'] = df1['D']
if ((df1['A'].loc[0] == 1) & (df1['B'].loc[0] == 1)):
    df1.loc[df1['D'].notnull() ,'C'] = 0
del df1['D']
#shifted column lag_C from column C
df1['lag_C'] = df1['C'].shift(1)
print df1
#    A  B  C  lag_C
#0   1  0  0    NaN
#1   1  0  0      0
#2   1  1  1      0
#3   1  1  1      1
#4   1  0  2      1
#5   0  0  3      2
#6   0  0  4      3
#7   1  1  1      4
#8   1  0  2      1
#9   0  1  1      2
#10  0  0  2      1
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you very much jezrael. Your approach of vectorizing the logic can definitely improve the speed. The follow up question is, if the logic gets so complicated and thus it is almost impossible to use vectorized computation, what would be the most efficient way? Thanks. – towin Nov 09 '15 at 22:03
  • How large is dataframe? How complicated is algorithm - can be use some functions from numpy? Is there used only basic math operation (+-×÷)? I think it depends on this factors. – jezrael Nov 09 '15 at 22:20
  • Right now I am just working on some simple translation. I did not dig too deep in the logic behind. It might be the case that logic can be simplified elsewhere before coming into this step. Thank you for answering. I voted up your answer and it will show up when I reach 15 reputation. – towin Nov 09 '15 at 22:50