0

I have a function that takes a pandas dataframe with index labels in the form of <int>_<int> (which basically denotes some size ranges in µm), and columns that hold values for separate samples in those size ranges.

The size ranges are consecutive as in the following example:

df = pd.DataFrame({'A': ['a', 'd', 'g', 'j'], 'B': ['b', 'e', 'h', 'k'], 'C': ['c', 'f', 'i', 'l']}, index = ['0_10', '10_20', '20_30', '30_40'])

        A   B   C
0_10    a   b   c
10_20   d   e   f
20_30   g   h   i
30_40   j   k   l

Note: for demonstration purpose the values are letter here. The real values are float64 numbers.

Here is the code that I am using so far. The docstring shows you what it is doing. As such it works fine, however, the nested loop and the iterative creation of new rows makes it very slow. For a dataframe with 200 rows and 21 columns it runs for about 2 min.

def combination_sums(df):  # TODO: speed up
    """
    Append new rows to a DF, where each new row is a column-wise sum of an original row
    and any possible combination of consecutively following rows. The input DF must have
    an index according to the scheme below.
    
    
    Example:
    
                INPUT  DF                             OUTPUT DF
                
             A        B        C                        A        B        C
             
      0_10   a        b        c               0_10     a        b        c   
     10_20   d        e        f     -->      10_20     d        e        f
     20_30   g        h        i              20_30     g        h        i
     30_40   j        k        l              30_40     j        k        l 
                                               0_20    a+d      b+e      c+f
                                               0_30   a+d+g    b+e+h    c+f+i
                                               0_40  a+d+g+j  b+e+h+k  c+f+i+l
                                              10_30    d+g      e+h      f+i
                                              10_40   d+g+j    e+h+k    f+i+l
                                              20_40    g+j      h+k      i+l
    """
    
    ol = len(df)  # original length
    
    for i in range(ol):
        for j in range(i+1,ol):
            new_row_name = df.index[i].split('_')[0] + '_' + df.index[j].split('_')[1]  # creates a string for the row index from the first and the last rows in the sum
            df.loc[new_row_name] = df.iloc[i:j].sum()
            
    return df

I am wondering what could be a better way to make it more efficient. I.e. using intermediate conversion to a numpy array and doing it in a vectorised operation. From somewhat similar posts (e.g. here), I thought there could be a way with numpy mgrid or ogrid, however, it was not similar enough for me to adapt it to what I want to achieve.

roble
  • 304
  • 1
  • 8
  • You should avoid using for loops to Speed up anyway. – Alexey Aug 05 '21 at 12:24
  • I know. That's why I posted this question. When I wrote the loops it was not for speed but for achieving what I needed to do. If you need a more realistic DF example: you can read one from the csv the I pasted here: https://pastebin.com/M6iu1CCA (valid for 2 weeks) This is actually quite close to what my real data looks like. – roble Aug 05 '21 at 13:05

0 Answers0