4

Is there a workaround to the issue #11675 in pandas ?

I would like to iterate over the following DataFrame and having the applied function called only once for every row:

import pandas
from pandas import Timestamp

test_data = {
    'input': {Timestamp('2015-05-01 12:30:00'): -1.,
              Timestamp('2015-05-01 12:30:01'): 0.,
              Timestamp('2015-05-01 12:30:02'): 1.,
              Timestamp('2015-05-01 12:30:03'): 0.,
              Timestamp('2015-05-01 12:30:04'): -1.
    }
}

def main():
    side_effects = {'state': 'B'}

    def disp(row):
        print('processing row:\n%s' % row)
        if side_effects['state'] == 'A':
            output = 1.
            if row['input'] == 1.:
                side_effects['state'] = 'B'

        else:
            output = -1.
            if row['input'] == -1.:
                side_effects['state'] = 'A'

        return pandas.Series({'input': row['input'], 'state': side_effects['state'], 'output': output})

    test_data_df = pandas.DataFrame(test_data)
    print(test_data_df.apply(disp, axis=1))

main()

At the moment the first row gets called twice with the following version of my environment:

python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
pandas: 0.17.0

So that the result DataFrame looks like:

                     input  output state
2015-05-01 12:30:00     -1       1     A
2015-05-01 12:30:01      0       1     A
2015-05-01 12:30:02      1       1     B
2015-05-01 12:30:03      0      -1     B
2015-05-01 12:30:04     -1      -1     A

Note that, surprisingly, when I change the input values to int from float in the test_data dict, I get the expected result:

                     input  output state
2015-05-01 12:30:00     -1      -1     A
2015-05-01 12:30:01      0       1     A
2015-05-01 12:30:02      1       1     B
2015-05-01 12:30:03      0      -1     B
2015-05-01 12:30:04     -1      -1     A

I understand that, as the pandas apply() doc mentions, that kind of side-effects should be avoided. So generally speaking, how can we run a state machine using DataFrame columns as input, since apply() is officially not suited for the job?

Christophe
  • 1,942
  • 3
  • 21
  • 29
  • 1
    It's not really a bug, and it's just when you use `apply`. Did you read the documentation referenced in the github function? – Paul H Nov 22 '15 at 19:51
  • 1
    Yes I saw that this is considered as, hem, feature. Poor decision I think, but ok. I would still like a way to make sure the function is called only once for each row... maybe iterating is simply the way to go in this case, but maybe there is a better solution? – Christophe Nov 22 '15 at 19:58
  • Can you make a more demonstrative example? You're not really even using the data frame on which you're calling `apply` other than the needless print statement. Seems like omitted that would provide the desired results. – Paul H Nov 23 '15 at 01:59
  • I can imagine a use case would be: generating a column representing a state, whose value changes according to both the previous state and the other current row values. This is typically a case the Pandas doc warns about when using the apply method ("side-effect")... – Christophe Nov 23 '15 at 17:25
  • `apply` works fine for that. – Paul H Nov 23 '15 at 17:26
  • I will come up with an example, but it requires some more efforts from my side. For example: creating a column whose value is the output of a function with some hysteresis. – Christophe Nov 23 '15 at 17:29
  • 1
    If you want to "to iterate over the DataFrame and having the applied function called only once for every row", you can always exactly do that: iterate with eg `for i, row in df.itertuples(): ..` and call the function on each row. – joris Nov 24 '15 at 13:49
  • joris, indeed a for loop seems the simplest way to handle this case. However it would have been nice to allow some side-effects on apply(). The more I think about it, the more I feel like apply() should be renamed: apply_twice_on_first_row_but_only_sometimes() ^_^... If you make it an answer I would approve it. – Christophe Nov 24 '15 at 16:50
  • My use case that lead me to discover this was simply "logging" reasons for NaTs from the applied function ... – germannp Jun 26 '19 at 12:29

0 Answers0