3

I'm doing a half-hourly date groupby and apply to calculate daily statistics on my dataset, but it's slow. Is there a way to improve the performance of the following functions? I've read about vectorisation, but am unsure how to implement it.

I've used apply and transform to get the outputs I would like, but it takes about 2-3 seconds for a whole year worth of data, I hope to do it faster as I have a lot data. Anyone can point me in the right direction?

import pandas as pd
import numpy as np
import timeit

# dummy data
date_range = pd.date_range('2017-01-01 00:00', '2018-01-01 00:00', freq='30Min')
df = pd.DataFrame(np.random.randint(2, 20, (date_range.shape[0], 2)), index=date_range, columns=['Electricity', 'Natural Gas'])

print(df.head())
print(df.shape)

t1 = timeit.default_timer()
onhour = df.groupby([pd.Grouper(freq='D')]).apply(lambda x: np.count_nonzero(
    x[x > x.quantile(0.05) + x.mean() * .1] >
    x.quantile(0.05) + 0.25 * (x.quantile(0.95)-x.quantile(0.05)),
    axis=0) / 2)

onhour = pd.DataFrame(
    onhour.values.tolist(),
    index=onhour.index,
    columns=df.columns)

print(f"start_time in {timeit.default_timer() - t1}")
print(onhour.head())

t1 = timeit.default_timer()
onhour = df.groupby([pd.Grouper(freq='D')]).transform(lambda x: np.count_nonzero(
    x[x > x.quantile(0.05) + x.mean() * .1] >
    x.quantile(0.05) + 0.25 * (x.quantile(0.95)-x.quantile(0.05)),
    axis=0) / 2).resample('D').mean()

print(f"start_time in {timeit.default_timer() - t1}")
print(onhour.head())
Chris
  • 1,287
  • 12
  • 31

1 Answers1

2

You're already using pandas vectorization optimization, so you can't gain a lot of time but a few tricks can get you in the 1.5 sec.

1) Use agg

Using agg instead of transform or apply will give better result because you have the same computation for each column (electricity & gas).

2) save your quantiles computations.

You are calculating 3 times the 5% quantile. I used a python function instead of lambda, you could still use lambda if you add a memoized quantile function (it could actually help fasten but I'm sure).

def count_something(row):
    qt_df = row.quantile([0.05, 0.95])
    return np.count_nonzero(
        row[row > qt_df.loc[0.05] + row.mean() * .1] > qt_df.loc[0.05] + 0.25 * (qt_df.loc[0.95] - qt_df.loc[0.05]),
        axis=0) / 2

t1 = timeit.default_timer()

onhour = df.groupby([pd.Grouper(freq='D')]).agg(count_something)

print(f"start_time in {timeit.default_timer() - t1}")
print(onhour.head())

If you really want to quicken the computation and you have ways to parralelize or distribute your computation, I imagine you could use python dask but I don't how much it could improve your problem.

godot
  • 1,550
  • 16
  • 33
  • Ah yes, perfect. This shaves of a good bit of time in total. There may be scope for dask in the future! Using lambda instead of a function seems to be marginally faster. – Chris Apr 19 '19 at 07:26
  • I also tried to see if calculating the median (50% quantile) was faster than the mean - as it's a quite good approximation with unskewed data. To my surprise, it does not seem to be quicker but i didn't do robust test... – godot Apr 19 '19 at 07:32