1

I have a pandas data frame made up of 3,373,612 rows. I would like to run some code on two of the columns to produce two new columns. My code throws an exception, and so to diagnose the cause I have cut back to the simplest code I can think of that takes a row and returns a series of two values:

def split_ids(row):
    return pd.Series(None, None)

analytic_events.apply(split_ids, axis=1)

I am running this in a Jupyter Notebook, but even so I am shocked that after five minutes the code is still running.

I must be misunderstanding something about pandas apply function. Why is simple code taking an inordinate amount of time to run through 3 million rows in a data frame?

dumbledad
  • 16,305
  • 23
  • 120
  • 273
  • 5
    **[When should I ever want to use pandas apply() in my code?](https://stackoverflow.com/questions/54432583/when-should-i-ever-want-to-use-pandas-apply-in-my-code)** This is highly likely to be slow, *whatever* you do. You can try (in this order) a vectorised solution, a list comprehension, or `raw=True`. – jpp Feb 01 '19 at 14:45
  • What would each of those suggestions look like for this simple example? N.B. I pressed "stop" and gave up waiting after 14 minutes! – dumbledad Feb 01 '19 at 14:50
  • Why not [edit your question](https://stackoverflow.com/posts/54481718/edit) with some representative data (a few rows suffice) and a real function? Then we can make suggestions. There's no "one-answer-fits-all" solution. – jpp Feb 01 '19 at 15:04
  • Also, ping me when done and I will happily reopen if you have provided a [mcve]. – jpp Feb 01 '19 at 15:08
  • Thanks @jpp. Sadly the data's personal medical data so I'd need to do lots of work to render it public. I am confused why it matters though, the function I am applying in the question ignores the incoming data. – dumbledad Feb 01 '19 at 15:11
  • Well, the thing with Pandas is what the function actually does *matters very much*. You can just add a column `df['col'] = None` if all you want is a function that does nothing. But that's probably *not* what you want. I'm saying categorically there's no silver bullet. – jpp Feb 01 '19 at 15:14
  • There is a lot of overhead when calling `pd.Series` in an apply function. For creating multiple columns, try one of the solutions [here](https://stackoverflow.com/a/42072756/7489710). Although you should also explore not using apply at all through the [str function](https://stackoverflow.com/a/47097625/7489710) – kayoz Feb 01 '19 at 15:30

0 Answers0