3

I'm a biology student who is fairly new to python and was hoping someone might be able to help with a problem I have yet to solve

With some subsequent code I have created a pandas dataframe that looks like the example below:

Distance.     No. of values        Mean rSquared
    1                   500                  0.6
    2                    80                  0.3
    3                    40                  0.4
    4                    30                  0.2
    5                    50                  0.2
    6                    30                  0.1

I can provide my previous code to create this dataframe, but I didn't think it was particularly relevant.

I need to sum the number of values column until I achieve a value >= 100; and then combine the data of the rows of the adjacent columns, taking the weighted average of the distance and mean r2 values, as seen in the example below

Mean Distance.             No. Of values             Mean rSquared
1                          500                       0.6
(80*2+40*3)/120            (80+40) = 120             (80*0.3+40*0.4)/120
(30*4+50*5+30*6)/110       (30+50+30) = 110          (30*0.2+50*0.2+30*0.1)/110

etc...

I know pandas has it's .cumsum function, which I might be able to implement into a for loop with an if statement that checks the upper limit and resets the sum back to 0 when it is greater than or equal to the upper limit. However, I haven't a clue how to average the adjacent columns.

Any help would be appreciated!

Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
James
  • 143
  • 1
  • 9

3 Answers3

2

You can use this code snippet to solve your problem.

# First, compute some weighted values
df.loc[:, "weighted_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "weighted_mean_rSquared"] = df["Mean rSquared"] * df["No. of values"]


min_threshold = 100
indexes = []
temp_sum = 0

# placeholder for final result
final_df = pd.DataFrame()
columns = ["Distance", "No. of values", "Mean rSquared"]

# reseting index to make the 'df' usable in following output
df = df.reset_index(drop=True)

# main loop to check and compute the desired output
for index, _ in df.iterrows():
    temp_sum += df.iloc[index]["No. of values"]
    indexes.append(index)

    # if the sum exceeds 'min_threshold' then do some computation
    if temp_sum >= min_threshold:
        temp_distance = df.iloc[indexes]["weighted_distance"].sum() / temp_sum
        temp_mean_rSquared = df.iloc[indexes]["weighted_mean_rSquared"].sum() / temp_sum
    
        # create temporary dataframe and concatenate with the 'final_df'
        temp_df = pd.DataFrame([[temp_distance, temp_sum, temp_mean_rSquared]], columns=columns)
        final_df = pd.concat([final_df, temp_df])
    
        # reset the variables
        temp_sum = 0
        indexes = []
sujanay
  • 126
  • 3
  • Thanks so much for your response, I didn't think I was going to end up with an answer so I ended up stumbling around in Python and came up with a similar solution, as such I'm giving the answer to you, thanks again! Edit: Didn't initially finish the comment as I didn't know enter submitted the comment (d'oh) – James Nov 17 '20 at 12:42
1

Numpy has a function numpy.frompyfunc You can use that to get the cumulative value based on a threshold.

Here's how to implement it. With that, you can then figure out the index when the value goes over the threshold. Use that to calculate the Mean Distance and Mean rSquared for the values in your original dataframe.

I also leveraged @sujanay's idea of calculating the weighted values first.

c = ['Distance','No. of values','Mean rSquared']
d = [[1,500,0.6], [2,80,0.3], [3,40,0.4],
     [4,30,0.2], [5,50,0.2], [6,30,0.1]]

import pandas as pd
import numpy as np

df = pd.DataFrame(d,columns=c)

#calculate the weighted distance and weighted mean squares first
df.loc[:, "w_distance"] = df["Distance"] * df["No. of values"]
df.loc[:, "w_mean_rSqrd"] = df["Mean rSquared"] * df["No. of values"]

#use numpy.frompyfunc to setup the threshold condition

sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)

#assign value to cumvals based on threshold
df['cumvals'] = sumvals.accumulate(df['No. of values'], dtype=np.object)

#find out all records that have >= 100 as cumulative values
idx = df.index[df['cumvals'] >= 100].tolist()

#if last row not in idx, then add it to the list
if (len(df)-1) not in idx: idx += [len(df)-1]

#iterate thru the idx for each set and calculate Mean Distance and Mean rSquared
i = 0
for j in idx:
    df.loc[j,'Mean Distance'] = (df.iloc[i:j+1]["w_distance"].sum() / df.loc[j,'cumvals']).round(2)
    df.loc[j,'New Mean rSquared'] = (df.iloc[i:j+1]["w_mean_rSqrd"].sum() / df.loc[j,'cumvals']).round(2)
    i = j+1

print (df)

The output of this will be:

   Distance  No. of values  ...  Mean Distance  New Mean rSquared
0         1            500  ...           1.00               0.60
1         2             80  ...            NaN                NaN
2         3             40  ...           2.33               0.33
3         4             30  ...            NaN                NaN
4         5             50  ...            NaN                NaN
5         6             30  ...           5.00               0.17

If you want to extract only the records that are non NaN, you can do:

final_df = df[df['Mean Distance'].notnull()]

This will result in:

   Distance  No. of values  ...  Mean Distance  New Mean rSquared
0         1            500  ...           1.00               0.60
2         3             40  ...           2.33               0.33
5         6             30  ...           5.00               0.17

I looked up BEN_YO's implementation of numpy.frompyfunc. The original SO post can be found here. Restart cumsum and get index if cumsum more than value

Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
1

If you figure out the grouping first, pandas groupby-functionality will do a lot of the remaining work for you. A loop is appropriate to get the grouping (unless somebody has a clever one-liner):

>>> groups = []
>>> group = 0
>>> cumsum = 0
>>> for n in df["No. of values"]:
...     if cumsum >= 100:
...         cumsum = 0
...         group = group + 1
...     cumsum = cumsum + n
...     groups.append(group)
>>>
>>> groups
[0, 1, 1, 2, 2, 2]

Before doing the grouped operations you need to use the No. of values information to get the weighting in:

df[["Distance.", "Mean rSquared"]] = df[["Distance.", "Mean rSquared"]].multiply(df["No. of values"], axis=0)

Now get the sums like this:

>>> sums = df.groupby(groups)["No. of values"].sum()
>>> sums
0    500
1    120
2    110
Name: No. of values, dtype: int64

And finally the weighted group averages like this:

>>> df[["Distance.", "Mean rSquared"]].groupby(groups).sum().div(sums, axis=0)
   Distance.  Mean rSquared
0   1.000000       0.600000
1   2.333333       0.333333
2   5.000000       0.172727
edvardlindelof
  • 206
  • 2
  • 4