I have some efficiency problems with the algorythm I am going to explain by snippets:
First the
df_fs
is created. I create a random DataFrame here to make it work with the exampleimport pandas as pd import numpy as np import random as rd import string R = 2500 # ROWS / 2 C = 100 # COLUMNS NPROF = 1 # NUMBER OF PROFILES, up to 6 STNNBR = 'STNNBR' d = {} for x in range(C): key = ''.join(rd.choices(string.ascii_uppercase + string.digits, k=10)) nan_list = [np.nan] * R float_list = [float(1000 * rd.random()) for i in range(R)] l = nan_list + float_list rd.shuffle(l) d[key] = l d[STNNBR] = [int(200 * rd.random()) for i in range(R*2)] df_fs = pd.DataFrame(d)
The list
cur_plot_cols
indicates the name of columns we are going to work with:pos_list = [] while len(pos_list) < 20: v = int(C * rd.random()) if v not in pos_list: pos_list.append(v) d_keys = list(d.keys()) cur_plot_cols = [d_keys[p] for p in pos_list]
The
prof_df
is a huge DataFrame that I initialize with manyNaN
values and many columns. The number of columns increases withcur_plot_cols
andNFPROF
:tab_list = ['SALNTY', 'OXYGEN', 'NITRAT', 'PHSPHT', 'SILCAT', 'ALKALI', 'TCARBN', 'PH_TOT', 'CFC_11', 'CFC_12', 'CFC113', 'SF6'] compound_cols = [] for tab in tab_list: for col in cur_plot_cols: for n in range(NPROF): compound_cols.append('{}_{}_{}'.format(tab, col, n)) d_aux = {} if compound_cols != []: d_aux = dict.fromkeys(compound_cols, []) prof_df = pd.DataFrame(d_aux) # init empty columns prof_df['INDEX'] = df_fs.index.values prof_df = prof_df.set_index(['INDEX'])
More variables that I need to make the example work:
plot_prof_invsbl_points = True stt_order_reversed = [31] # up to 6 elements tabs_flags_plots = { 'NITRAT': { # tab name 'flag': 'NITRAT_FLAG_W', }, 'SALNTY': { 'flag': 'SALNTY_FLAG_W', }, } visible_flags = [3, 4, 5, 6]
Finally the problematic algorithm, the line labeled with
FIXME
is the main bottle neckf = cur_plot_cols + [STNNBR] df_fs = df_fs.filter(f) for tab in tab_list: i = NPROF - 1 for stt in stt_order_reversed: for col in cur_plot_cols: df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()] if plot_prof_invsbl_points is False: # this is never True in this example extracted from the original code if tab in tabs_flags_plots.keys(): flag = tabs_flags_plots[tab]['flag'] df_aux = df_aux[df_aux[flag].isin(visible_flags)] prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col] # FIXME: this is the main bottle neck i -= 1
Measurements
I have measured the time with the line_profile
tool and this is the result:
Line # Hits Time Per Hit % Time Line Contents
==============================================================
31 13 114.0 8.8 0.0 for tab in tab_list:
32 12 148.0 12.3 0.0 i = NPROF - 1
37 24 267.0 11.1 0.0 for stt in stt_order_reversed:
38 372 12430.0 33.4 0.0 for col in cur_plot_cols:
39 360 12890156.0 35806.0 13.1 df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]
40 360 11687.0 32.5 0.0 if plot_prof_invsbl_points is False:
41 flag = self.env.tabs_flags_plots[tab]['flag']
42 df_aux = df_aux[df_aux[flag].isin(self.env.visible_flags)]
43 360 85075802.0 236321.7 86.3 prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]
44 12 201.0 16.8 0.0 i -= 1
Any suggestion to make these lines faster?
df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]
prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]
Notes
Actually the real index I am using into the DataFrames are hashes, that is to say, strings.
I have to update the prof_df
DataFrame columns. But the columns names are made with the parameters [tab, col, i]
, I need to iterate over them to set the column I want to update in each iteration. Is there a way to iterate faster and update those columns? Any alternative?
Some time ago I used this expression to assign grouped values:
ml_df['xs{}'.format(n_plot)] = df_p.groupby('STNNBR').apply(lambda x: list(x[col_x_name]))
But I do not know how to apply the same method here because this time I need the column name and the i value on the left side of the assigment.
The algorythm takes 6 seconds to complete, too much.