How to update a group of rows values efficienly in a DataFrame? How to make this algorythm scalable?

Question

I have some efficiency problems with the algorythm I am going to explain by snippets:

First the df_fs is created. I create a random DataFrame here to make it work with the example

import pandas as pd
import numpy as np
import random as rd
import string

R = 2500    # ROWS / 2
C = 100     # COLUMNS
NPROF = 1   # NUMBER OF PROFILES, up to 6

STNNBR = 'STNNBR'
d = {}
for x in range(C):
    key = ''.join(rd.choices(string.ascii_uppercase + string.digits, k=10))
    nan_list = [np.nan] * R
    float_list = [float(1000 * rd.random()) for i in range(R)]
    l = nan_list + float_list
    rd.shuffle(l)
    d[key] = l

d[STNNBR] = [int(200 * rd.random()) for i in range(R*2)]

df_fs = pd.DataFrame(d)

The list cur_plot_cols indicates the name of columns we are going to work with:

pos_list = []
while len(pos_list) < 20:
    v = int(C * rd.random())
    if v not in pos_list:
        pos_list.append(v)
d_keys = list(d.keys())
cur_plot_cols = [d_keys[p] for p in pos_list]

The prof_df is a huge DataFrame that I initialize with many NaN values and many columns. The number of columns increases with cur_plot_cols and NFPROF:

tab_list = ['SALNTY', 'OXYGEN', 'NITRAT', 'PHSPHT', 'SILCAT', 'ALKALI', 'TCARBN', 'PH_TOT', 'CFC_11', 'CFC_12', 'CFC113', 'SF6']
compound_cols = []
for tab in tab_list:
    for col in cur_plot_cols:
        for n in range(NPROF):
            compound_cols.append('{}_{}_{}'.format(tab, col, n))

d_aux = {}
if compound_cols != []:
    d_aux = dict.fromkeys(compound_cols, [])
prof_df = pd.DataFrame(d_aux)  # init empty columns
prof_df['INDEX'] = df_fs.index.values
prof_df = prof_df.set_index(['INDEX'])

More variables that I need to make the example work:

plot_prof_invsbl_points = True
stt_order_reversed = [31]         # up to 6 elements
tabs_flags_plots = {
    'NITRAT': {                   # tab name
        'flag': 'NITRAT_FLAG_W',
    },
    'SALNTY': {
        'flag': 'SALNTY_FLAG_W',
    },
}
visible_flags = [3, 4, 5, 6]

Finally the problematic algorithm, the line labeled with FIXME is the main bottle neck

f = cur_plot_cols + [STNNBR]
df_fs = df_fs.filter(f)

for tab in tab_list:
    i = NPROF - 1
    for stt in stt_order_reversed:
        for col in cur_plot_cols:
            df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]
            if plot_prof_invsbl_points is False:      # this is never True in this example extracted from the original code
                if tab in tabs_flags_plots.keys():
                    flag = tabs_flags_plots[tab]['flag']
                    df_aux = df_aux[df_aux[flag].isin(visible_flags)]
            prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]  # FIXME: this is the main bottle neck
        i -= 1

Measurements

I have measured the time with the line_profile tool and this is the result:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31        13        114.0      8.8      0.0      for tab in tab_list:
    32        12        148.0     12.3      0.0          i = NPROF - 1
    37        24        267.0     11.1      0.0          for stt in stt_order_reversed:
    38       372      12430.0     33.4      0.0              for col in cur_plot_cols:
    39       360   12890156.0  35806.0     13.1                  df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]
    40       360      11687.0     32.5      0.0                  if plot_prof_invsbl_points is False:
    41                                                               flag = self.env.tabs_flags_plots[tab]['flag']
    42                                                               df_aux = df_aux[df_aux[flag].isin(self.env.visible_flags)]
    43       360   85075802.0 236321.7     86.3                  prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]
    44        12        201.0     16.8      0.0              i -= 1

Any suggestion to make these lines faster?

df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]

prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]

Notes

Actually the real index I am using into the DataFrames are hashes, that is to say, strings.

I have to update the prof_df DataFrame columns. But the columns names are made with the parameters [tab, col, i], I need to iterate over them to set the column I want to update in each iteration. Is there a way to iterate faster and update those columns? Any alternative?

Some time ago I used this expression to assign grouped values:

ml_df['xs{}'.format(n_plot)] = df_p.groupby('STNNBR').apply(lambda x: list(x[col_x_name]))

But I do not know how to apply the same method here because this time I need the column name and the i value on the left side of the assigment.

The algorythm takes 6 seconds to complete, too much.

Could you add an example dataframe and what your expected output looks like? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples — Erfan, Apr 08 '19 at 18:30
I have added some more details @Erfan. Tomorrow I will add a more complete example if I have time. — ChesuCR, Apr 08 '19 at 18:43
Yes please, its easier to understand your problem and help you better. Please comment here when you added the more complete example. — Erfan, Apr 08 '19 at 18:56
Hi @Erfan I have added a complete example. I have identified the line that takes more time to be computed as well. Let me know if you need to know anything else. Thank you in advance — ChesuCR, Apr 10 '19 at 16:10
Hello @Ben.T !! You have helped me once in the past with [this other question](https://stackoverflow.com/a/50025959/4891717). Could you lend me a hand with this new algorithm? I will appreciate it. — ChesuCR, Apr 12 '19 at 14:50

score 0 · Answer 1 · answered Apr 08 '19 at 22:16

0

I'm still confused about your ask, but I think you should try looking at a loop using enumerate:

http://book.pythontips.com/en/latest/enumerate.html

This will allow you to use the i value as well as the column name.

answered Apr 08 '19 at 22:16

lots_of_dumb_questions

21
2

Thank for your answer. I have updated my question, the performance problem appears when I make assigments to many rows in the DataFrame – ChesuCR Apr 10 '19 at 16:18

score 0 · Accepted Answer · answered May 03 '19 at 18:50

Just with the following change the time decreased to half. But it is still not enough:

prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]
# >>
prof_df.assign(**{'{}_{}_{}'.format(tab, col, i): df_aux[col] })

Then I have extracted the assignment out of the loop and I organized the loops in a better way. The performance was improved 7 times.

df_fs = df_fs[df_fs[STNNBR].isin(stt_order_reversed)]
d_temp = {}
for tab in tab_list:
    i = NPROF - 1

    if tab in tabs_flags_plots.keys():
        flag = tabs_flags_plots[tab]['flag']
        df_fs = df_fs[df_fs[flag].isin(visible_flags)]

    for stt in stt_order_reversed:
        df_stt = df_tab[df_tab[STNNBR] == stt]
        for col in cur_plot_cols:
            df_aux = df_stt[col]
            d_temp['{}_{}_{}'.format(tab, col, i)] = df_aux
        i -= 1
prof_df = prof_df.assign(**d_temp)

Note: Now I want to make it parallel with numba or joblib, but maybe I will take a look at it in the future. Currently the result is good enough

How to update a group of rows values efficienly in a DataFrame? How to make this algorythm scalable?

Measurements

Notes

2 Answers2