-1

I have a large dataframe and I need to loop through it. However, it takes a long time for a very large dataframe. I know iterrows is quiet slow and vectorization much faster. However, I don't know how to rewrite an iterrows loop.

My dataframe is given as follows:

print(df_toe.head(10))

 z_toe  dn50_toe  Nod  ht/h  output_ok
0   -3.5  0.067171  NaN   NaN        1.0
1   -3.5  0.082472  NaN   NaN        1.0
2   -3.5  0.095543  NaN   NaN        1.0
3   -3.5  0.196341  NaN   NaN        1.0
4   -3.5  0.232024  NaN   NaN        1.0
5   -3.5  0.347270  NaN   NaN        1.0
6   -3.5  0.353661  NaN   NaN        1.0
7   -3.5  0.404841  NaN   NaN        1.0
8   -3.5  0.632502  NaN   NaN        1.0
9   -3.5  0.922923  NaN   NaN        1.0

With some extra parameters:

z_bed = -4.5 
swl = 1.8

The iterrows loop through the dataframe df_toe is written as follows:

def dftoe_det_2nd(df_toe):

    for i in df_toe.index:
        'Define input variables'
        z_toe = df_toe.get_value(i,'z_toe')
        dn50_toe = df_toe.get_value(i,'dn50_toe')

        'Define restrictions between which it can operate for z_toe/h'
        h = swl - z_bed
        ht = swl - z_toe
        df_toe.set_value(i,'ht/h',abs(ht / h))

        if z_toe < z_bed:
            df_toe.set_value(i,'output_ok',0)

        'Show all waterheights'
        df_toe.set_value(i,'Nod',Nodtoe())

        if 0.90 < abs(ht / h) or 0.4 > abs(ht / h):
            df_toe.set_value(i,'output_ok',0)

        if h > 25:
            df_toe.set_value(i,'output_ok',0)

    df_toe = df_toe[df_toe['output_ok'] == 1]
    del df_toe['output_ok']
    return df_toe

Does anyone know how this can be optimized in the sense of velocity and computation time?

Floris
  • 13
  • 3

1 Answers1

0

You can follow https://stackoverflow.com/a/28490706/3528612 and try openmp over the loop. Or if you have the resources, i.e. more processors you can try mpi4py and parallelize the loop into small chunks to make this faster

Manmeet Singh
  • 405
  • 2
  • 11