How to get better performance when taking the mean of a specific subset of a dataframe in pandas?

Question

I have a performance problem. The following code took 3 hours to loop through 5000 items out of 50 000.

I have a dataframe df, and a list of dictionary keys to loop through key_list. Each key corresponds to a single index of the dataframe. At each index, I want to get the mean of the columns mean_cols a few rows before and a few rows after the index, and then create a new dictionary with the before and after columns.

mean_cols = ['A', 'B', 'C']
rows_list = []
key_list = list(some_dict.keys()) # around 50k items

for key in key_list:

    means_after = df[mean_cols].iloc[key:key+5].mean()
    means_before = df[mean_cols].iloc[key-5:key].mean()

    for col in mean_cols:
        row_dict[str(col+'_after')] = round(means_after[col], 2)
        row_dict[str(col+'_before')] = round(means_before[col], 2)

    rows_list.append(row_dict)

I am pretty sure its the two lines,

means_after = df[mean_cols].iloc[key:key+5].mean()
means_before = df[mean_cols].iloc[key-5:key].mean()

however I can't think of a faster way to do it. Anyone have any ideas?

totalhack · Answer 1 · 2020-02-26T00:38:03.487

2

Have a look at the pandas rolling functionality. The docs list various windowing strategies and options to attempt to tackle your problem. You'll use the rolling method of your DataFrame with options suitable for your desired windows and then chain a mean call to get a resulting rolling mean DataFrame.

EDIT: Here is a simplified example to get you started. You'll want to review the output and make sure that's what you are going for, and this also has NaN values until the window size is full.

If I increase the columns to each have 50k items this takes 7 milliseconds on my laptop.

import pandas as pd

mean_cols = ["A", "B"]

df = pd.DataFrame({
    "A": range(0, 50000),
    "B": range(0, 50000)
})

rm_before = df[mean_cols].rolling(5).mean()
rm_after = rm_before.shift(-4)

for col in mean_cols:
    df[col+"_before"] = rm_before[col]
    df[col+"_after"] = rm_after[col]

print(df)

edited Feb 26 '20 at 00:38

answered Feb 25 '20 at 20:11

totalhack

2,298
17
23

From a quick scan of the docs it looks like I would have to specify a new rolling window for each index (as there isn't really any periodicity in my indices). I don't think that will necessarily result in better performance but will give it a try tomorrow thanks. – Joram Feb 25 '20 at 20:23
If I understand your code correctly, it looks like you could just do a single rolling window of size 5 and take the mean. Then, assuming its a sliding window that is not centered, you could iterate over that and get the mean of the last 5 items by accessing the current item of the windowed mean df, and the mean of the next five by looking 5 values ahead. – totalhack Feb 25 '20 at 20:29

Gwang-Jin Kim · Accepted Answer · 2020-02-26T06:29:45.197

1

Pandas .mean() seems to have reputation to be slow.

An idea I would have is to use numpy by converting using pandas' built-in .to_numpy(). However, then if you want to have column-wise mean calculation, numpy's .mean() needs and axis specification - otherwise it will calculate mean of all values in the numpy array.

import pandas as pd
import numpy as np
import random

# from @totalhack
mean_cols = ["A", "B"]

df = pd.DataFrame({
    "A": range(0, 50000),
    "B": range(0, 50000)
})

key_list = random.sample(range(50000), k=50000)
# in case that key_list are rownames (indexes), convert them into
# row_indexes, because numpy array won't have names. E.g. by:
# my_rownames = [x for x in your_df_with_rownames.indexes]
# key_list = [my_rownames.index(k) for k in your_old_keylist]

df_mc = np.array(df[mean_cols])

rows_list = []

for key in keys_list:

    means_after = df_mc[key:key+5].mean(axis=0)
    means_before = df_mc[key-5:key].mean(axis=0)
    row_dict = {}

    for col in mean_cols:
        row_dict[str(col+'_after')] = round(means_after[mean_cols.index(col)], 2)
        row_dict[str(col+'_before')] = round(means_before[mean_cols.index(col)], 2)

    rows_list.append(row_dict)

If the data frame has only numeric values, it would accelerate calculations a lot more to convert it as early as possible to np.arrays. However, I guess there are text or date data too in the data frame. So the earliest time point to convert to numpy array is I guess directly after subsetting the mean_cols - so that is where I put .to_numpy().

Or using parallelization (using more cpus in parallel)


df_mc = np.array(df[mean_cols])

def mean_after(key, np_array=df_mc):
    return list(np.round(np_array[key: key+5].mean(axis=0), 2))

def mean_before(key, np_array=df_mc):
    return list(np.round(np_array[key-5:key].mean(axis=0), 2))


import multiprocessing as mp

pool = mp.Pool()

afters = pool.map(mean_after, keys_list)
befores = pool.map(mean_before, keys_list)

# for what do you need rows_list with dictionaires for each column value?
# why not accessing like this the afters or befores?

afters[row_idx][mean_cols.index(col)]
befores[row_idx][mean_cols.index(col)]

edited Feb 26 '20 at 06:29

answered Feb 25 '20 at 20:19

Gwang-Jin Kim

9,303
17
30

Interesting. I see this question https://stackoverflow.com/questions/18701569/pandas-dataframe-mean-very-slow-how-can-i-calculate-means-of-columns-faster also indicates slow mean with some good answers. Can't try it today but will try tomorrow at work. Thanks! – Joram Feb 25 '20 at 20:27
@Joram welcome! It is sth I use too - the trick - convert to numpy to speed up calculations on data frames. `numpy` is more or less C. That's why. But one has to have then only numeric values in that data frame for converting it to np.arrays for numeric calculations. Because of typing. – Gwang-Jin Kim Feb 25 '20 at 20:28
So the suggestion is to do this within the loop? I don't think that would be very efficient. Have you tested the performance of this? – totalhack Feb 25 '20 at 22:45
I cannot test because you didn't give any minimal running example. – Gwang-Jin Kim Feb 25 '20 at 22:48
Python for-loops are not that slow. – Gwang-Jin Kim Feb 25 '20 at 22:54
1

Perhaps you could test the performance of this with the simple example dataframe in [my answer](https://stackoverflow.com/a/60402440/10682164). – totalhack Feb 26 '20 at 00:39
1

From a quick test on my laptop this took close to a second, vs 7 ms for using pandas builtin `rolling` feature as described in my answer. Without the parallelization it takes about 2 seconds. So it's somewhere in the 100-200x slower range. Let me know if you see something different. – totalhack Feb 26 '20 at 01:11
@totalhack you are right. `rolling` is much faster! And feels much more light weight. – Gwang-Jin Kim Feb 26 '20 at 01:17
for-loop is faster than parallelization. – Gwang-Jin Kim Feb 26 '20 at 01:22
@totalhack however, what he wants is not just rolling - he wants for each key the 5 lines before and after ... So you hve also to take a keys_list and subset your df using the keys. – Gwang-Jin Kim Feb 26 '20 at 01:29
1

I don't think so -- while the question isn't perfectly clear, the code looks to be calculating a mean of the last 5 items and next 5 items, which is what I'm accomplishing using `rolling`. – totalhack Feb 26 '20 at 01:35
@totalhack as far as I understood it is not just rolling, but picking lines according to keys and in that position 5 rows before or after to calculate the column-means ... as far as I understood ... – Gwang-Jin Kim Feb 26 '20 at 01:42
1

Calculating the mean of the last/next N rows relative to a key is exactly what `rolling` does (using the chained `mean` call). I think you might be misunderstanding what `rolling` does, or what the OP is trying to do in their code. It appears to be a simple fixed window length mean computation. – totalhack Feb 26 '20 at 01:55
@totalhack Thanks for your explanation and also your simple example. - However I thought: The OP's code shows that he takes some keys from a dictionary - which might be rownames or row indexes. Since dict's have no order - it is not ensured that they are in order. But your `.rolling(5)` would just go through every row in order and take the last 5 rows or the current row + the next 4 following rows. I thought the OP wanted arbitrary order. Imagine, his `key_list` might be `[0, 10, 3, 1000, 4, ...]` - that is why is `random.sample()`d the indexes. – Gwang-Jin Kim Feb 26 '20 at 06:22
@Gwang-JinKim You are indeed correct in understanding my problem. The indices in `key_list` are pretty random (always increasing but not by the same amount). Probably still possible to use `rolling`, but you would use it on the whole dataframe and then jump to the indices to extract. Which does not sound like it would be more efficient, but @totalhack seems really convinced that it's super fast, so probably worth a try. – Joram Feb 26 '20 at 07:01
@Joram Ah I see a way: use `.rolling().mean()` first. then, build a dictionary of the form `{rowname_or_index: rolling_result}`, and then you can call the results using your keys. In case of row_indexes: `dct_before = {i: x for x in rm_before}` and in case or row_names: `dict_before = {nm: x for nm, x in zip(df.indexes, rm_before)}` and then in both cases: `result_column = [dict_before[key] for key in key_list]`. That will be then the speed-optimal solution! – Gwang-Jin Kim Feb 26 '20 at 07:15
@totalhack use my last comment to build your optimal answer! – Gwang-Jin Kim Feb 26 '20 at 07:18
FWIW, it doesn't matter that the keys are out of order in your example. You are still taking a 5 item window around each key when you get it. The result would be the same had you ordered the keys before entering the loop in your code, except that `rows_list` ends up in a predictable order instead of random (why would you want that anyway?). @Joram the "jump to indices" you are describing [will be very fast](https://stackoverflow.com/a/16629125/10682164). – totalhack Feb 26 '20 at 14:15
@totalhack - yes, I realized that (see my last two comments). Your method is then the fastest definitely! - I'm giving +1 to you! – Gwang-Jin Kim Feb 26 '20 at 14:23

How to get better performance when taking the mean of a specific subset of a dataframe in pandas?

2 Answers2

Or using parallelization (using more cpus in parallel)