19

It seems apply could accelerate the operation process on dataframe in most cases but, when I use apply I don't find the speedup. Here is my example; I have a dataframe with two columns:

>>>df
index col1 col2
1 10 20
2 20 30
3 30 40

What I want to do is to calculate values for each row in the dataframe by implementing a function R(x) on col1 and the result will be divided by the values in col2. For example, the result of the first row should be R(10)/20.

This is my function which will be called in apply:

def _f(input):
    return R(input['col1'])/input['col2']

Then I call _f in apply: df.apply(_f, axis=1)

But, I find in this case, apply is much slower than a for loop, like

for i in list(df.index)
    new_df.loc[i] = R(df.loc[i,'col1'])/df.loc[i,'col2']

Can anyone explain the reason?

David Buck
  • 3,752
  • 35
  • 31
  • 35
Vision
  • 548
  • 1
  • 7
  • 20
  • could there be something funny about the first row of data? Apply calls the function twice on the first row to determine the shape of the returned data to intelligently figure out how it will be combined. This is by design and in the docs. see the notes here http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html se – AZhao Aug 14 '16 at 02:00

1 Answers1

48

It is my understanding that .apply is not generally faster than iteration over the axis. I believe underneath the hood it is merely a loop over the axis, except you are incurring the overhead of a function call each time in this case.

If we look at the source code, we can see that essentially we are iterating over the indicated axis and applying the function, building the individual results as series into a dictionary, and the finally calling the dataframe constructor on the dictionary returning a new DataFrame:

    if axis == 0:
        series_gen = (self._ixs(i, axis=1)
                      for i in range(len(self.columns)))
        res_index = self.columns
        res_columns = self.index
    elif axis == 1:
        res_index = self.index
        res_columns = self.columns
        values = self.values
        series_gen = (Series.from_array(arr, index=res_columns, name=name,
                                        dtype=dtype)
                      for i, (arr, name) in enumerate(zip(values,
                                                          res_index)))
    else:  # pragma : no cover
        raise AssertionError('Axis must be 0 or 1, got %s' % str(axis))

    i = None
    keys = []
    results = {}
    if ignore_failures:
        successes = []
        for i, v in enumerate(series_gen):
            try:
                results[i] = func(v)
                keys.append(v.name)
                successes.append(i)
            except Exception:
                pass
        # so will work with MultiIndex
        if len(successes) < len(res_index):
            res_index = res_index.take(successes)
    else:
        try:
            for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)
        except Exception as e:
            if hasattr(e, 'args'):
                # make sure i is defined
                if i is not None:
                    k = res_index[i]
                    e.args = e.args + ('occurred at index %s' %
                                       pprint_thing(k), )
            raise

    if len(results) > 0 and is_sequence(results[0]):
        if not isinstance(results[0], Series):
            index = res_columns
        else:
            index = None

        result = self._constructor(data=results, index=index)
        result.columns = res_index

        if axis == 1:
            result = result.T
        result = result._convert(datetime=True, timedelta=True, copy=False)

    else:

        result = Series(results)
        result.index = res_index

    return result

Specifically:

for i, v in enumerate(series_gen):
                results[i] = func(v)
                keys.append(v.name)

Where series_gen was constructed based on the requested axis.

To get more performance out of a function, you can follow the advice given here.

Essentially, your options are:

  1. Write a C extension
  2. Use numba (a JIT compiler)
  3. Use pandas.eval to squeeze performance out of large Dataframes
juanpa.arrivillaga
  • 88,713
  • 10
  • 131
  • 172