Error in variable shift in pandas

Question

In the example below, the first apply works. The second throws "TypeError: ("Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'", u'occurred at index 0')"

df = pd.DataFrame({'lag':[ 3, 5, 3, 4, 2, 3, 2, 3, 4, 3, 2, 2, 2, 3],
                   'A':[10,20,30,40,20,30,40,10,20,30,15,60,20,15],
                   'B':[11,21,31,41,21,31,41,11,21,31,15,61,21,25]})
df['C'] = df.apply(lambda x: df['A'].shift(x['lag'])[x.name], axis=1)
print df
df['D'] = df.apply(lambda x: df['B'].shift(x['lag'])[x.name], axis=1)
print df

Please tell me why this happens and how to fix it. Thanks,

(Note: I do not have enough "points" to post a comment in Variable shift in Pandas)

piRSquared · Answer 1 · 2017-03-13T21:18:50.043

2

This is actually a tricky thing going on. I'll try to be succinct.

When you are using apply with axis=1 you are iterating row by row. For each row, pandas handles it as a pd.Series. After your initial assignment, you put NaN values in the df When that row is accessed, the entire row gets interpreted as float

work around # 1
Ensure lag value is int

df['D'] = df.apply(lambda x: df['B'].shift(int(x['lag']))[x.name], axis=1)

work around # 2
Do assignments at same time

df = df.assign(
    C=df.apply(lambda x: df['A'].shift(x['lag'])[x.name], axis=1),
    D=df.apply(lambda x: df['B'].shift(int(x['lag']))[x.name], axis=1)
)

better solution
However, I'd use numpy to help with this

Those lags are just the current position values less the lag value

l = (np.arange(len(df)) - df.lag.values)

then

df['C'] = np.where(l >= 0, df.A.values[l], np.nan)
df['D'] = np.where(l >= 0, df.B.values[l], np.nan)

edited Mar 13 '17 at 21:18

answered Mar 13 '17 at 21:07

piRSquared

285,575
57
475
624

Thanks! Works like a champ! – Bill Mar 13 '17 at 21:14
@Bill I've got better in an update coming in a few minutes – piRSquared Mar 13 '17 at 21:15
That's crafty. I assume this works a lot faster since it operates "column-wise" instead of 'row-wise')? – Bill Mar 13 '17 at 21:28
@Bill I added that answer to the question you referenced. Go there and look at the time differences. The speed up comes from not iterating over every row and instead using a vectorized approach of slicing. – piRSquared Mar 13 '17 at 21:30

Error in variable shift in pandas

1 Answers1