I have a function that takes a pandas dataframe with index labels in the form of <int>_<int>
(which basically denotes some size ranges in µm), and columns that hold values for separate samples in those size ranges.
The size ranges are consecutive as in the following example:
df = pd.DataFrame({'A': ['a', 'd', 'g', 'j'], 'B': ['b', 'e', 'h', 'k'], 'C': ['c', 'f', 'i', 'l']}, index = ['0_10', '10_20', '20_30', '30_40'])
A B C
0_10 a b c
10_20 d e f
20_30 g h i
30_40 j k l
Note: for demonstration purpose the values are letter here. The real values are float64 numbers.
Here is the code that I am using so far. The docstring shows you what it is doing. As such it works fine, however, the nested loop and the iterative creation of new rows makes it very slow. For a dataframe with 200 rows and 21 columns it runs for about 2 min.
def combination_sums(df): # TODO: speed up
"""
Append new rows to a DF, where each new row is a column-wise sum of an original row
and any possible combination of consecutively following rows. The input DF must have
an index according to the scheme below.
Example:
INPUT DF OUTPUT DF
A B C A B C
0_10 a b c 0_10 a b c
10_20 d e f --> 10_20 d e f
20_30 g h i 20_30 g h i
30_40 j k l 30_40 j k l
0_20 a+d b+e c+f
0_30 a+d+g b+e+h c+f+i
0_40 a+d+g+j b+e+h+k c+f+i+l
10_30 d+g e+h f+i
10_40 d+g+j e+h+k f+i+l
20_40 g+j h+k i+l
"""
ol = len(df) # original length
for i in range(ol):
for j in range(i+1,ol):
new_row_name = df.index[i].split('_')[0] + '_' + df.index[j].split('_')[1] # creates a string for the row index from the first and the last rows in the sum
df.loc[new_row_name] = df.iloc[i:j].sum()
return df
I am wondering what could be a better way to make it more efficient. I.e. using intermediate conversion to a numpy
array and doing it in a vectorised operation. From somewhat similar posts (e.g. here), I thought there could be a way with numpy
mgrid
or ogrid
, however, it was not similar enough for me to adapt it to what I want to achieve.