13

Apply function seems to work very slow with a large dataframe (about 1~3 million rows).

I have checked related questions here, like Speed up Pandas apply function, and Counting within pandas apply() function, it seems the best way to speed it up is not to use apply function :)

For my case, I have two kinds of tasks to do with the apply function.

First: apply with lookup dict query

f(p_id, p_dict):
    return p_dict[p_dict['ID'] == p_id]['value']

p_dict = DataFrame(...)  # it's another dict works like lookup table
df = df.apply(f, args=(p_dict,))

Second: apply with groupby

f(week_id, min_week_num, p_dict):
    return p_dict[(week_id - min_week_num < p_dict['WEEK']) & (p_dict['WEEK'] < week_id)].ix[:,2].mean()

f_partial = partial(f, min_week_num=min_week_num, p_dict=p_dict)
df = map(f, df['WEEK'])

I guess for the fist case, it could be done with dataframe join, while I am not sure about resource cost for such join on a large dataset.

My question is:

  1. Is there any way to substitute apply in the two above cases?
  2. Why is apply so slow? For the dict lookup case, I think it should be O(N), it shouldn't cost that much even if N is 1 million.
Community
  • 1
  • 1
linpingta
  • 2,324
  • 2
  • 18
  • 36
  • 1
    What is your first part of the question trying to do? From the definition of `f`, I have some suspicion it's not doing what you want it to do. Perhaps you should add small examples of the dataframes and the expected results. – Ami Tavory Aug 01 '16 at 18:59
  • Thanks Ami, for the first question, for example, p_dict is a country id with country name lookup table, I want to query country name with country id as input, if no country id exists, return NA :) – linpingta Aug 02 '16 at 01:49

1 Answers1

10

Concerning your first question, I can't say exactly why this instance is slow. But generally, apply does not take advantage of vectorization. Also, apply returns a new Series or DataFrame object, so with a very large DataFrame, you have considerable IO overhead (I cannot guarantee this is the case 100% of the time since Pandas has loads of internal implementation optimization).

For your first method, I assume you are trying to fill a 'value' column in df using the p_dict as a lookup table. It is about 1000x faster to use pd.merge:

import string, sys

import numpy as np
import pandas as pd

##
# Part 1 - filling a column by a lookup table
##
def f1(col, p_dict):
    return [p_dict[p_dict['ID'] == s]['value'].values[0] for s in col]

# Testing
n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'ID': [s for s in string.ascii_uppercase], 'value': np.random.randint(0,n_size, 26)})
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})

# Apply the f1 method  as posted
%timeit -n1 -r5 temp = df.apply(f1, args=(p_dict,))
>>> 1 loops, best of 5: 832 ms per loop

# Using merge
np.random.seed(997)
df = pd.DataFrame({'p_id': [string.ascii_uppercase[i] for i in np.random.randint(0,26, n_size)]})
%timeit -n1 -r5 temp = pd.merge(df, p_dict, how='inner', left_on='p_id', right_on='ID', copy=False)

>>> 1000 loops, best of 5: 826 µs per loop

Concerning the second task, we can quickly add a new column to p_dict that calculates a mean where the time window starts at min_week_num and ends at the week for that row in p_dict. This requires that p_dict is sorted by ascending order along the WEEK column. Then you can use pd.merge again.

I am assuming that min_week_num is 0 in the following example. But you could easily modify rolling_growing_mean to take a different value. The rolling_growing_mean method will run in O(n) since it conducts a fixed number of operations per iteration.

n_size = 1000
np.random.seed(997)
p_dict = pd.DataFrame({'WEEK': range(52), 'value': np.random.randint(0, 1000, 52)})
df = pd.DataFrame({'WEEK': np.random.randint(0, 52, n_size)})

def rolling_growing_mean(values):
    out = np.empty(len(values))
    out[0] = values[0]
    # Time window for taking mean grows each step
    for i, v in enumerate(values[1:]):
        out[i+1] = np.true_divide(out[i]*(i+1) + v, i+2)
    return out

p_dict['Means'] = rolling_growing_mean(p_dict['value'])

df_merged = pd.merge(df, p_dict, how='inner', left_on='WEEK', right_on='WEEK')
andrew
  • 3,929
  • 1
  • 25
  • 38
  • Thanks andrew, really nice and kind for the first question :) But I think you may misunderstand my second question, as week_id is input parameter, for example, week_id = 5, min_week_num = 2, then I want to query mean of p_dict['WEEK'] value with 3-4, while week_id = 6, query with 4-5, so I guess it's not equivalent to move it out. – linpingta Aug 02 '16 at 01:50
  • @linpingta yes of course. I was thinking of math, not coding! I deleted that part of the answer. – andrew Aug 02 '16 at 02:02
  • So any way to solve that? I know maybe I should put it to another question, but any advise I'll be appreciated :) – linpingta Aug 02 '16 at 02:10
  • It looks like you want `min_week_num` to stay fixed while `week_id` changes. Is that correct? – andrew Aug 02 '16 at 02:30
  • yes andrew, week_id changes as different rows of dataframe, while min_week_num could be treated as a const value here – linpingta Aug 02 '16 at 02:47
  • 1
    @linpingta did that edit help with your second question? If not, maybe you should post a new question that includes minimal, complete and verifiable example. – andrew Aug 03 '16 at 16:22