1

I have a big dataframe which has two million rows. There are 60000 unique (store_id, product_id) pairs.

I need select by each (store_id, product_id), do some calculation , such as resample to H , sum , avg . Finally, concat all to a new dataframe.

The problem is it is very very slow, and become slower while running.

The mainly code is:

def process_df(df, func, *args, **kwargs):
    '''
    '''
    product_ids = df.product_id.unique()
    store_ids = df.store_id.unique()

    # uk = df.drop_duplicates(subset=['store_id','product_id'])
    # for idx, item in uk.iterrows():

    all_df = list()
    i = 1

    with tqdm(total=product_ids.shape[0]*store_ids.shape[0]) as t:

        for store_id in store_ids:
            sdf = df.loc[df['store_id']==store_id]
            for product_id in product_ids:
                new_df = sdf.loc[(sdf['product_id']==product_id) ]

                if new_df.shape[0] < 14:
                    continue

                new_df = func(new_df, *args, **kwargs)
                new_df.loc[:, 'store_id'] = store_id
                new_df.loc[:, 'product_id'] = product_id

                all_df.append(new_df)

                t.update()

        all_df= pd.concat(all_df)

    return all_df


def process_order_items(df, store_id=None, product_id=None, freq='D'):
    if store_id and "store_id" in df.columns:
        df = df.loc[df['store_id']==store_id]

    if product_id and "product_id" in df.columns:
        df = df.loc[df['product_id']==product_id]

    # convert to datetime

    df.loc[:, "datetime_create"] = pd.to_datetime(df.time_create, unit='ms').dt.tz_localize('UTC').dt.tz_convert('Asia/Shanghai').dt.tz_localize(None)
    df = df[["price", "count", "fee_total", "fee_real", "price_real",  "price_guide", "price_change_category", "datetime_create"]]

    df.loc[:, "has_discount"] = (df.price_change_category > 0).astype(int) 
    df.loc[:, "clearance"] = df.price_change_category.apply(lambda x:x in(10, 20, 23)).astype(int) 

    if not freq:
        df.loc[:, "date_create"] = df["datetime_create"]
    else:
        assert freq in ('D', 'H')
        df.index = df.loc[:, "datetime_create"]

        discount_order_count = df['has_discount'].resample(freq).sum()

        clearance_order_count = df['clearance'].resample(freq).sum()

        discount_sale_count = df.loc[df.has_discount >0, 'count'].resample(freq).sum()
        clearance_sale_count = df.loc[df.clearance >0, 'count'].resample(freq).sum()

        no_discount_price = df.loc[df.has_discount == 0, 'price'].resample(freq).sum()
        no_clearance_price = df.loc[df.clearance == 0, 'price'].resample(freq).sum()

        order_count = df['count'].resample(freq).count()
        day_count = df['count'].resample(freq).sum()

        price_guide = df['price_guide'].resample(freq).max()
        price_avg = (df['price'] * df['count']).resample(freq).sum() / day_count

        df = pd.DataFrame({
            "price":price_avg,
            "price_guide": price_guide,
            "sale_count": day_count,
            "order_count": order_count,
            "discount_order_count": discount_order_count,
            "clearance_order_count": clearance_order_count,
            "discount_sale_count": discount_sale_count,
            "clearance_sale_count": clearance_sale_count,
        })

        df = df.drop(df[df.order_count == 0].index)

    return df

I think the problem is there are too many redundant selections.

Maybe I could use groupby(['store_id','product_id']).agg to avoid redundant , but I have no idea how to use process_order_items with it and merge results together.

Mithril
  • 12,947
  • 18
  • 102
  • 153
  • Since your code _runs error-free_, this question is off topic on SO. You may want to post it at [CodeReview](https://codereview.stackexchange.com/). – DYZ Aug 22 '18 at 04:51

1 Answers1

3

I think you can change:

df.loc[:,"clearance"] = df.price_change_category.apply(lambda x:x in(10, 20, 23)).astype(int) 

to Series.isin:

df["clearance"] = df.price_change_category.isin([10, 20, 23]).astype(int) 

Also solution for Resampler.aggregate:

d = {'has_discount':'sum',
     'clearance':'sum',
     'count': ['count', 'sum'],
     'price_guide':'max'}

df1 = df.resample(freq).agg(d)
df1.columns = df1.columns.map('_'.join)
d1 = {'has_discount_count':'discount_order_count',
      'clearance_count':'clearance_order_count',
      'count_count':'order_count',
      'count_sum':'day_count',
      'price_guide_max':'price_guide'}
df1.rename(columns=d1)

Another idea is no convert boolean mask to integer, but use columns for filtering like:

df["has_discount"] = df.price_change_category > 0
df["clearance"] = df.price_change_category.isin([10, 20, 23])


discount_sale_count = df.loc[df.has_discount, 'count'].resample(freq).sum()
clearance_sale_count = df.loc[df.clearance, 'count'].resample(freq).sum()

#for filtering ==0 invert boolean mask columns by ~
no_discount_price = df.loc[~df.has_discount, 'price'].resample(freq).sum()
no_clearance_price = df.loc[~df.clearance, 'price'].resample(freq).sum()

First function should be simplify by GroupBy.apply instaed loops, then concat is not necessary:

def f(x):
    print (x)

df = df.groupby(['product_id','store_id']).apply(f)
Mithril
  • 12,947
  • 18
  • 102
  • 153
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • The `process_order_items` is passed as func in `process_df` . I think is the selection in `process_df` slowdown the program. But I don'tknow how to groupby with `process_order_items` . Because each product in each store may have different price , that's why I need select them at first. – Mithril Aug 22 '18 at 05:26
  • Do you mean the only inefficiency in my code is `process_order_items` ? – Mithril Aug 22 '18 at 05:27
  • @Mithril - Yes, exactly. – jezrael Aug 22 '18 at 05:29
  • @Mithril - Is possible add some sample data, e.g. 10 rows? Because code was writing without testing ;) – jezrael Aug 22 '18 at 05:48
  • Sorry I can't provide sample data for securit concern . – Mithril Aug 22 '18 at 06:33
  • Thank you very much, I found tqdm would slow down pandas too , using `tqdm.pandas(); df.progress_apply(func)` combine with reducing resample times , successfully speed up 10 time than original. – Mithril Aug 22 '18 at 06:33
  • @Mithril - Super, glad can help! – jezrael Aug 22 '18 at 06:33
  • You can cite this in answer https://stackoverflow.com/questions/18603270/progress-indicator-during-pandas-operations-python – Mithril Aug 22 '18 at 06:34