2

In the example below, the first apply works. The second throws "TypeError: ("Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'", u'occurred at index 0')"

df = pd.DataFrame({'lag':[ 3, 5, 3, 4, 2, 3, 2, 3, 4, 3, 2, 2, 2, 3],
                   'A':[10,20,30,40,20,30,40,10,20,30,15,60,20,15],
                   'B':[11,21,31,41,21,31,41,11,21,31,15,61,21,25]})
df['C'] = df.apply(lambda x: df['A'].shift(x['lag'])[x.name], axis=1)
print df
df['D'] = df.apply(lambda x: df['B'].shift(x['lag'])[x.name], axis=1)
print df

Please tell me why this happens and how to fix it. Thanks,

(Note: I do not have enough "points" to post a comment in Variable shift in Pandas)

Community
  • 1
  • 1
Bill
  • 21
  • 1

1 Answers1

2

This is actually a tricky thing going on. I'll try to be succinct.

When you are using apply with axis=1 you are iterating row by row. For each row, pandas handles it as a pd.Series. After your initial assignment, you put NaN values in the df When that row is accessed, the entire row gets interpreted as float


work around # 1
Ensure lag value is int

df['D'] = df.apply(lambda x: df['B'].shift(int(x['lag']))[x.name], axis=1)

work around # 2
Do assignments at same time

df = df.assign(
    C=df.apply(lambda x: df['A'].shift(x['lag'])[x.name], axis=1),
    D=df.apply(lambda x: df['B'].shift(int(x['lag']))[x.name], axis=1)
)

better solution
However, I'd use numpy to help with this

Those lags are just the current position values less the lag value

l = (np.arange(len(df)) - df.lag.values)

then

df['C'] = np.where(l >= 0, df.A.values[l], np.nan)
df['D'] = np.where(l >= 0, df.B.values[l], np.nan)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks! Works like a champ! – Bill Mar 13 '17 at 21:14
  • @Bill I've got better in an update coming in a few minutes – piRSquared Mar 13 '17 at 21:15
  • That's crafty. I assume this works a lot faster since it operates "column-wise" instead of 'row-wise')? – Bill Mar 13 '17 at 21:28
  • @Bill I added that answer to the question you referenced. Go there and look at the time differences. The speed up comes from not iterating over every row and instead using a vectorized approach of slicing. – piRSquared Mar 13 '17 at 21:30