Is there a workaround to the issue #11675 in pandas ?
I would like to iterate over the following DataFrame and having the applied function called only once for every row:
import pandas
from pandas import Timestamp
test_data = {
'input': {Timestamp('2015-05-01 12:30:00'): -1.,
Timestamp('2015-05-01 12:30:01'): 0.,
Timestamp('2015-05-01 12:30:02'): 1.,
Timestamp('2015-05-01 12:30:03'): 0.,
Timestamp('2015-05-01 12:30:04'): -1.
}
}
def main():
side_effects = {'state': 'B'}
def disp(row):
print('processing row:\n%s' % row)
if side_effects['state'] == 'A':
output = 1.
if row['input'] == 1.:
side_effects['state'] = 'B'
else:
output = -1.
if row['input'] == -1.:
side_effects['state'] = 'A'
return pandas.Series({'input': row['input'], 'state': side_effects['state'], 'output': output})
test_data_df = pandas.DataFrame(test_data)
print(test_data_df.apply(disp, axis=1))
main()
At the moment the first row gets called twice with the following version of my environment:
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
pandas: 0.17.0
So that the result DataFrame looks like:
input output state
2015-05-01 12:30:00 -1 1 A
2015-05-01 12:30:01 0 1 A
2015-05-01 12:30:02 1 1 B
2015-05-01 12:30:03 0 -1 B
2015-05-01 12:30:04 -1 -1 A
Note that, surprisingly, when I change the input values to int from float in the test_data
dict, I get the expected result:
input output state
2015-05-01 12:30:00 -1 -1 A
2015-05-01 12:30:01 0 1 A
2015-05-01 12:30:02 1 1 B
2015-05-01 12:30:03 0 -1 B
2015-05-01 12:30:04 -1 -1 A
I understand that, as the pandas apply() doc mentions, that kind of side-effects should be avoided. So generally speaking, how can we run a state machine using DataFrame columns as input, since apply() is officially not suited for the job?