Speeding up Pandas apply function

Question

For a relatively big Pandas DataFrame (a few 100k rows), I'd like to create a series that is a result of an apply function. The problem is that the function is not very fast and I was hoping that it can be sped up somehow.

df = pd.DataFrame({
 'value-1': [1, 2, 3, 4, 5],
 'value-2': [0.1, 0.2, 0.3, 0.4, 0.5],
 'value-3': somenumbers...,
 'value-4': more numbers...,
 'choice-index': [1, 1, np.nan, 2, 1]
})

def func(row):
  i = row['choice-index']
  return np.nan if math.isnan(i) else row['value-%d' % i]

df['value'] = df.apply(func, axis=1, reduce=True)

# expected value = [1, 2, np.nan, 0.4, 5]

Any suggestions are welcome.

Update

A very small speedup (~1.1) can be achieved by pre-caching the selected columns. func would change to:

cached_columns = [None, 'value-1', 'value-2', 'value-3', 'value-4']
def func(row):
  i = row['choice-index']
  return np.nan if math.isnan(i) else row[cached_columns[i]]

But I was hoping for greater speedups...

did you try cython, numba, eval+numexpr suggested in http://pandas.pydata.org/pandas-docs/stable/enhancingperf.html — denfromufa, Jul 12 '15 at 13:04
No, not for this particular problem. But I think the main problem is the number of calls to the apply function, so `cython`, `numba`, `numexpr`, etc. won't help much to alleviate this. — orange, Jul 13 '15 at 01:56

orange · Accepted Answer · 2015-07-12T05:13:25.070

6

I think I got a good solution (speedup ~150).

The trick is not to use apply, but to do smart selections.

choice_indices = [1, 2, 3, 4]
for idx in choice_indices:
  mask = df['choice-index'] == idx
  result_column = 'value-%d' % (idx)
  df.loc[mask, 'value'] = df.loc[mask, result_column]

edited Jul 12 '15 at 05:13

answered Jul 12 '15 at 03:38

orange

7,755
14
75
139

Speeding up Pandas apply function

1 Answers1

Linked