1

I have some efficiency problems with the algorythm I am going to explain by snippets:

  1. First the df_fs is created. I create a random DataFrame here to make it work with the example

    import pandas as pd
    import numpy as np
    import random as rd
    import string
    
    R = 2500    # ROWS / 2
    C = 100     # COLUMNS
    NPROF = 1   # NUMBER OF PROFILES, up to 6
    
    STNNBR = 'STNNBR'
    d = {}
    for x in range(C):
        key = ''.join(rd.choices(string.ascii_uppercase + string.digits, k=10))
        nan_list = [np.nan] * R
        float_list = [float(1000 * rd.random()) for i in range(R)]
        l = nan_list + float_list
        rd.shuffle(l)
        d[key] = l
    
    d[STNNBR] = [int(200 * rd.random()) for i in range(R*2)]
    
    df_fs = pd.DataFrame(d)
    
  2. The list cur_plot_cols indicates the name of columns we are going to work with:

    pos_list = []
    while len(pos_list) < 20:
        v = int(C * rd.random())
        if v not in pos_list:
            pos_list.append(v)
    d_keys = list(d.keys())
    cur_plot_cols = [d_keys[p] for p in pos_list]
    
  3. The prof_df is a huge DataFrame that I initialize with many NaN values and many columns. The number of columns increases with cur_plot_cols and NFPROF:

    tab_list = ['SALNTY', 'OXYGEN', 'NITRAT', 'PHSPHT', 'SILCAT', 'ALKALI', 'TCARBN', 'PH_TOT', 'CFC_11', 'CFC_12', 'CFC113', 'SF6']
    compound_cols = []
    for tab in tab_list:
        for col in cur_plot_cols:
            for n in range(NPROF):
                compound_cols.append('{}_{}_{}'.format(tab, col, n))
    
    d_aux = {}
    if compound_cols != []:
        d_aux = dict.fromkeys(compound_cols, [])
    prof_df = pd.DataFrame(d_aux)  # init empty columns
    prof_df['INDEX'] = df_fs.index.values
    prof_df = prof_df.set_index(['INDEX'])
    
  4. More variables that I need to make the example work:

    plot_prof_invsbl_points = True
    stt_order_reversed = [31]         # up to 6 elements
    tabs_flags_plots = {
        'NITRAT': {                   # tab name
            'flag': 'NITRAT_FLAG_W',
        },
        'SALNTY': {
            'flag': 'SALNTY_FLAG_W',
        },
    }
    visible_flags = [3, 4, 5, 6]
    
  5. Finally the problematic algorithm, the line labeled with FIXME is the main bottle neck

    f = cur_plot_cols + [STNNBR]
    df_fs = df_fs.filter(f)
    
    for tab in tab_list:
        i = NPROF - 1
        for stt in stt_order_reversed:
            for col in cur_plot_cols:
                df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]
                if plot_prof_invsbl_points is False:      # this is never True in this example extracted from the original code
                    if tab in tabs_flags_plots.keys():
                        flag = tabs_flags_plots[tab]['flag']
                        df_aux = df_aux[df_aux[flag].isin(visible_flags)]
                prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]  # FIXME: this is the main bottle neck
            i -= 1
    

Measurements

I have measured the time with the line_profile tool and this is the result:

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    31        13        114.0      8.8      0.0      for tab in tab_list:
    32        12        148.0     12.3      0.0          i = NPROF - 1
    37        24        267.0     11.1      0.0          for stt in stt_order_reversed:
    38       372      12430.0     33.4      0.0              for col in cur_plot_cols:
    39       360   12890156.0  35806.0     13.1                  df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]
    40       360      11687.0     32.5      0.0                  if plot_prof_invsbl_points is False:
    41                                                               flag = self.env.tabs_flags_plots[tab]['flag']
    42                                                               df_aux = df_aux[df_aux[flag].isin(self.env.visible_flags)]
    43       360   85075802.0 236321.7     86.3                  prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]
    44        12        201.0     16.8      0.0              i -= 1

Any suggestion to make these lines faster?

df_aux = df_fs[(df_fs[STNNBR] == stt) & df_fs[col].notnull()]

prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]

Notes

Actually the real index I am using into the DataFrames are hashes, that is to say, strings.

I have to update the prof_df DataFrame columns. But the columns names are made with the parameters [tab, col, i], I need to iterate over them to set the column I want to update in each iteration. Is there a way to iterate faster and update those columns? Any alternative?

Some time ago I used this expression to assign grouped values:

ml_df['xs{}'.format(n_plot)] = df_p.groupby('STNNBR').apply(lambda x: list(x[col_x_name]))

But I do not know how to apply the same method here because this time I need the column name and the i value on the left side of the assigment.

The algorythm takes 6 seconds to complete, too much.

ChesuCR
  • 9,352
  • 5
  • 51
  • 114
  • 1
    Could you add an example dataframe and what your expected output looks like? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Erfan Apr 08 '19 at 18:30
  • I have added some more details @Erfan. Tomorrow I will add a more complete example if I have time. – ChesuCR Apr 08 '19 at 18:43
  • Yes please, its easier to understand your problem and help you better. Please comment here when you added the more complete example. – Erfan Apr 08 '19 at 18:56
  • Hi @Erfan I have added a complete example. I have identified the line that takes more time to be computed as well. Let me know if you need to know anything else. Thank you in advance – ChesuCR Apr 10 '19 at 16:10
  • Hello @Ben.T !! You have helped me once in the past with [this other question](https://stackoverflow.com/a/50025959/4891717). Could you lend me a hand with this new algorithm? I will appreciate it. – ChesuCR Apr 12 '19 at 14:50

2 Answers2

0

I'm still confused about your ask, but I think you should try looking at a loop using enumerate:

http://book.pythontips.com/en/latest/enumerate.html

This will allow you to use the i value as well as the column name.

  • Thank for your answer. I have updated my question, the performance problem appears when I make assigments to many rows in the DataFrame – ChesuCR Apr 10 '19 at 16:18
0

Just with the following change the time decreased to half. But it is still not enough:

prof_df.loc[df_aux.index.values, '{}_{}_{}'.format(tab, col, i)] = df_aux[col]
# >>
prof_df.assign(**{'{}_{}_{}'.format(tab, col, i): df_aux[col] })

Then I have extracted the assignment out of the loop and I organized the loops in a better way. The performance was improved 7 times.

df_fs = df_fs[df_fs[STNNBR].isin(stt_order_reversed)]
d_temp = {}
for tab in tab_list:
    i = NPROF - 1

    if tab in tabs_flags_plots.keys():
        flag = tabs_flags_plots[tab]['flag']
        df_fs = df_fs[df_fs[flag].isin(visible_flags)]

    for stt in stt_order_reversed:
        df_stt = df_tab[df_tab[STNNBR] == stt]
        for col in cur_plot_cols:
            df_aux = df_stt[col]
            d_temp['{}_{}_{}'.format(tab, col, i)] = df_aux
        i -= 1
prof_df = prof_df.assign(**d_temp)

Note: Now I want to make it parallel with numba or joblib, but maybe I will take a look at it in the future. Currently the result is good enough

ChesuCR
  • 9,352
  • 5
  • 51
  • 114