3

I am trying to make a Python program that will calculate a result based on a formula, given factors and an input dataframe.

I have a number of cars (N_cars) on a given length of the road (l) and their average speed (v):

input_columns = ['l', 'N_cars', 'v']
input_data = [[3.5, 1000, 100], [5.7, 500, 110], 
              [10, 367, 110], [11.1, 1800, 95],
              [2.8, 960, 105], [4.7, 800, 120], 
              [10.4, 103, 111], [20.1, 1950, 115]]
        
input_df = pd.DataFrame(input_data, columns=input_columns)
input_df

      l  N_cars    v
0   3.5    1000  100
1   5.7     500  110
2  10.0     367  110
3  11.1    1800   95
4   2.8     960  105
5   4.7     800  120
6  10.4     103  111
7  20.1    1950  115

I also know the factors needed for the formula for each category of car, and I know the percentage of each category. I also have different options for each category (3 options that I have here are just an example, there are many more options).

factors_columns = ['category', 'category %', 'option', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
factors_data = [['A', 58, 'opt_1', 0.000011, 0.23521, 0.93847, 0.39458, 0.00817, 0.24566, 0.0010, 0],
                ['A', 58, 'opt_2', 0.000011, 0.23521, 0.93145, 0.39458, 0.00467, 0.24566, 0.0010, 0],
                ['A', 58, 'opt_3', 0.000011, 0.23521, 0.93145, 0.39458, 0.00467, 0.24566, 0.0010, 0],
                ['B', 22, 'opt_1', 0.002452, 0.48327, 0.83773, 0.92852, 0.00871, 0.29568, 0.0009, 0.02],
                ['B', 22, 'opt_2', 0.002899, 0.49327, 0.83773, 0.92852, 0.00871, 0.30468, 0.0009, 0.02],
                ['B', 22, 'opt_3', 0.002452, 0.48327, 0.83773, 0.92852, 0.00771, 0.29568, 0.0119, 0.01],
                ['C', 17, 'opt_1', 0.082583, 0.39493, 0.02462, 0.82714, 0.00918, 0.28572, 0.0012, 0],
                ['C', 17, 'opt_2', 0.072587, 0.35493, 0.02852, 0.82723, 0.00912, 0.29572, 0.0018, 0], 
                ['C', 17, 'opt_3', 0.082583, 0.39493, 0.02852, 0.82714, 0.00962, 0.28572, 0.0012, 0.01], 
                ['D', 3, 'opt_1', 0.018327, 0.32342, 0.82529, 0.92752, 0.00988, 0.21958, 0.0016, 0],
                ['D', 3, 'opt_2', 0.014427, 0.32342, 0.82729, 0.92752, 0.00968, 0.22558, 0.0026, 0],
                ['D', 3, 'opt_3', 0.018327, 0.32342, 0.82729, 0.94452, 0.00988, 0.21258, 0.0016, 0]]
        
factors_df = pd.DataFrame(factors_data, columns=factors_columns)
factors_df

   category  category % option         a        b        c        d        e        f       g     h
0         A          58  opt_1  0.000011  0.23521  0.93847  0.39458  0.00817  0.24566  0.0010  0.00
1         A          58  opt_2  0.000011  0.23521  0.93145  0.39458  0.00467  0.24566  0.0010  0.00
2         A          58  opt_3  0.000011  0.23521  0.93145  0.39458  0.00467  0.24566  0.0010  0.00
3         B          22  opt_1  0.002452  0.48327  0.83773  0.92852  0.00871  0.29568  0.0009  0.02
4         B          22  opt_2  0.002899  0.49327  0.83773  0.92852  0.00871  0.30468  0.0009  0.02
5         B          22  opt_3  0.002452  0.48327  0.83773  0.92852  0.00771  0.29568  0.0119  0.01
6         C          17  opt_1  0.082583  0.39493  0.02462  0.82714  0.00918  0.28572  0.0012  0.00
7         C          17  opt_2  0.072587  0.35493  0.02852  0.82723  0.00912  0.29572  0.0018  0.00
8         C          17  opt_3  0.082583  0.39493  0.02852  0.82714  0.00962  0.28572  0.0012  0.01
9         D           3  opt_1  0.018327  0.32342  0.82529  0.92752  0.00988  0.21958  0.0016  0.00
10        D           3  opt_2  0.014427  0.32342  0.82729  0.92752  0.00968  0.22558  0.0026  0.00
11        D           3  opt_3  0.018327  0.32342  0.82729  0.94452  0.00988  0.21258  0.0016  0.00

For each option (opt_1, opt_2, opt_3), I have to calculate the result based on this formula (factors are taken from the factors table, but v is coming from the input table):

formula = ( (a*v*v) + (b*v) + c + (d/v) )  /  ( (e*v*v) + (f*v) + g) * (1 - h)
result = l * N_cars * formula

However, I have to take into account the percentage of each category of car. For each row of the input_df I have to perform the calculations three times, once for each of the three options. For example, for the index 0 of input_df, I have N_cars=1000, v=100 and l=3.5, the output should be something like this:

# for opt_1:
result = 3.5 * 1000 * ((58% of category A {formula for index 0 of factors_df}) +
                       (22% of category B {formula for index 3 of factors_df) +
                       (17% of category C {formula for index 6 of factors_df}) +
                       (3% of category D {formula for index 9 of factors_df}) )

# for opt_2:
result = 3.5 * 1000 * ((58% of category A {formula for index 1 of factors_df}) +
                       (22% of category B {formula for index 4 of factors_df) +
                       (17% of category C {formula for index 7 of factors_df}) +
                       (3% of category D {formula for index 10 of factors_df}) )

# for opt_3:
result = 3.5 * 1000 * ((58% of category A {formula for index 2 of factors_df}) +
                       (22% of category B {formula for index 5 of factors_df) +
                       (17% of category C {formula for index 8 of factors_df}) +
                       (3% of category D {formula for index 11 of factors_df}) )

So, as an output, for each of the rows in input_df, I should have three results, one for each of the three options.

I can do the calculation manually for each step, but what I am having troubles with is to make a loop that does it automatically for each input row and all 3 options and then passes to the next input row and so on until the last input row.

ddejohn
  • 8,775
  • 3
  • 17
  • 30
ctrl_z
  • 268
  • 1
  • 15
  • It's not clear what you mean by "I have to take into account the % of each categories". Percentage of *what*? The total number of rows in that category? – ddejohn Feb 06 '22 at 18:54
  • that percantages show how the cars are divided. So If I have 100 as N_cars, I know that 58 of them are of category A, 22 of category B, 17 of category C and 3 of category D – ctrl_z Feb 06 '22 at 18:56
  • I find the question very clear and well written. Don't understand the downvotes. Please comment reasons before downvoting. – Krishnakanth Allika Feb 06 '22 at 19:21
  • 1
    Does https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas help? – Karl Knechtel Feb 07 '22 at 01:08
  • 1
    While I think OP could benefit from learning how to use `map`, `apply` etc, I think this is a complicated enough use-case to warrant keeping this question open, as it involves using `apply` across two different dataframes, one of which is grouped. – ddejohn Feb 07 '22 at 01:44
  • @ddejohn, I did it before and re-did it again :) – ctrl_z Feb 24 '22 at 09:45

3 Answers3

7

Solution

Not sure what your expected results are, but I believe this does what you're asking for:

def formula(g, *, l, N_cars, v):
    x = (1 - g.h) * (g.a * v*v + g.b*v + g.c + g.d/v) / (g.e * v*v + g.f*v + g.g)
    return N_cars * l * (x * g.pct / 100).sum()


groups = factors_df.rename(columns={"category %": "pct"}).groupby("option")
result = input_df.apply(lambda r: groups.apply(lambda g: formula(g, **r)), axis=1)

Output:

In [5]: input_df.join(result)
Out[5]:
      l  N_cars    v         opt_1         opt_2         opt_3
0   3.5    1000  100   5411.685077   5115.048256   5500.985916
1   5.7     500  110   4425.339734   4169.893681   4483.595803
2  10.0     367  110   5698.595376   5369.652565   5773.612841
3  11.1    1800   95  30820.717985  29180.106606  31384.785443
4   2.8     960  105   4165.270216   3930.726187   4226.877893
5   4.7     800  120   5860.057879   5506.509637   5919.496692
6  10.4     103  111   1663.960420   1567.455541   1685.339848
7  20.1    1950  115  60976.735053  57375.300546  61685.075902

Explanation

The first step is to group factors_df by option. Just to show what that looks like:

In [6]: groups.apply(print)
  category  pct option         a        b  ...        d        e        f       g     h
0        A   58  opt_1  0.000011  0.23521  ...  0.39458  0.00817  0.24566  0.0010  0.00
3        B   22  opt_1  0.002452  0.48327  ...  0.92852  0.00871  0.29568  0.0009  0.02
6        C   17  opt_1  0.082583  0.39493  ...  0.82714  0.00918  0.28572  0.0012  0.00
9        D    3  opt_1  0.018327  0.32342  ...  0.92752  0.00988  0.21958  0.0016  0.00

[4 rows x 11 columns]
   category  pct option         a        b  ...        d        e        f       g     h
1         A   58  opt_2  0.000011  0.23521  ...  0.39458  0.00467  0.24566  0.0010  0.00
4         B   22  opt_2  0.002899  0.49327  ...  0.92852  0.00871  0.30468  0.0009  0.02
7         C   17  opt_2  0.072587  0.35493  ...  0.82723  0.00912  0.29572  0.0018  0.00
10        D    3  opt_2  0.014427  0.32342  ...  0.92752  0.00968  0.22558  0.0026  0.00

[4 rows x 11 columns]
   category  pct option         a        b  ...        d        e        f       g     h
2         A   58  opt_3  0.000011  0.23521  ...  0.39458  0.00467  0.24566  0.0010  0.00
5         B   22  opt_3  0.002452  0.48327  ...  0.92852  0.00771  0.29568  0.0119  0.01
8         C   17  opt_3  0.082583  0.39493  ...  0.82714  0.00962  0.28572  0.0012  0.01
11        D    3  opt_3  0.018327  0.32342  ...  0.94452  0.00988  0.21258  0.0016  0.00

Note that I renamed the category % to pct. This isn't necessary, but made accessing that column in the formula() function a bit cleaner (g.pct vs g["category %"]).

The next step was to implement formula() in such a way as to accept a group from factors_df as an argument:

def formula(g, *, l, N_cars, v):
    x = (1 - g.h) * (g.a * v*v + g.b*v + g.c + g.d/v) / (g.e * v*v + g.f*v + g.g)
    return N_cars * l * (x * g.pct / 100).sum()

In the function signature, g is a group from factors_df, then the keyword-only arguments l, N_cars, and v, which will come from a single row of input_df at a time.

Each of the three groups shown above will be entered into the formula() function one at a time, in their entirety. For example, during one call to formula(), the g argument will hold all of this data:

  category  pct option         a        b  ...        d        e        f       g     h
0        A   58  opt_1  0.000011  0.23521  ...  0.39458  0.00817  0.24566  0.0010  0.00
3        B   22  opt_1  0.002452  0.48327  ...  0.92852  0.00871  0.29568  0.0009  0.02
6        C   17  opt_1  0.082583  0.39493  ...  0.82714  0.00918  0.28572  0.0012  0.00
9        D    3  opt_1  0.018327  0.32342  ...  0.92752  0.00988  0.21958  0.0016  0.00

When the formula uses something like g.e, it's accessing the entire e column, and is taking advantage of vectorization to perform the arithmetic calculations on the entire column at the same time. When the dust settles, x will be a Series where each item in the series will be the result of the formula for each of the four categories of car. Here's an example:

0    0.231242
3    0.619018
6    7.188941
9    1.792376

Notice the indices? Those correspond to category A, B, C, and D from factors_df, respectively.

From there, we need to call formula() on each row of input_df, using the axis argument of pd.DataFrame.apply():

input_df.apply(lambda r: groups.apply(lambda g: formula(g, **r)), axis=1)

The lambda r is an anonymous function object being passed to apply, being applied over axis 1, meaning that r will be a single row from input_df at a time, for example:

In [13]: input_df.apply(print, axis=1)
l            3.5
N_cars    1000.0
v          100.0
Name: 0, dtype: float64
.
.
.

Now, on each row-wise apply, we're also applying the formula() function on the groups groupby object with lambda g: formula(g, **r). The **r unpacks the row from input_df as keyword arguments, which helps to ensure that the values for v, l, and N_cars aren't misused in the formula (no need to worry about which order they're passed into the formula() function).

ddejohn
  • 8,775
  • 3
  • 17
  • 30
  • 1
    ddejohn, I already accepted this answer before and Icommented with a thank you note, but I don't see that comment anymore :( I re-did it now, and I wanted to thank you a lot for the detailed explanation, it really helped a lot to understand the process :) – ctrl_z Feb 24 '22 at 09:43
  • Cheers, glad I could help. – ddejohn Feb 24 '22 at 16:05
1

Here is the code I wrote. It's somewhat long but it works. May be you (or someone) can modify and make it shorter.

# Transforming factors_df
df = factors_df.pivot(columns=["category", "option"])
df.reset_index(inplace=True)
# Renaming column names for each combination of option and category
df.columns = [s3 + s2 + s1 for (s1, s2, s3) in df.columns.to_list()]
df.drop(columns=["index"], inplace=True)
# Flattening to a single row to be able to apply formula
df = pd.DataFrame(df.max()).T
# Merging input with transformed factors data
input_df["tmp"] = 1
df["tmp"] = 1
df = pd.merge(input_df, df, on="tmp", how="left")
df.drop("tmp", axis=1, inplace=True)

# Calculating values for opt_1 using the formula
df["opt_1_value"] = (
    df["l"]
    * df["N_cars"]
    * (
        (
            df["opt_1Acategory %"]
            / 100
            * (
                df["opt_1Aa"] * df["v"] * df["v"]
                + df["opt_1Ab"] * df["v"]
                + df["opt_1Ac"]
                + df["opt_1Ad"] / df["v"]
            )
            / (
                (
                    df["opt_1Ae"] * df["v"] * df["v"]
                    + df["opt_1Af"] * df["v"]
                    + df["opt_1Ag"]
                )
                * (1 - df["opt_1Ah"])
            )
        )
        + (
            df["opt_1Bcategory %"]
            / 100
            * (
                df["opt_1Ba"] * df["v"] * df["v"]
                + df["opt_1Bb"] * df["v"]
                + df["opt_1Bc"]
                + df["opt_1Bd"] / df["v"]
            )
            / (
                (
                    df["opt_1Be"] * df["v"] * df["v"]
                    + df["opt_1Bf"] * df["v"]
                    + df["opt_1Bg"]
                )
                * (1 - df["opt_1Bh"])
            )
        )
        + (
            df["opt_1Ccategory %"]
            / 100
            * (
                df["opt_1Ca"] * df["v"] * df["v"]
                + df["opt_1Cb"] * df["v"]
                + df["opt_1Cc"]
                + df["opt_1Cd"] / df["v"]
            )
            / (
                (
                    df["opt_1Ce"] * df["v"] * df["v"]
                    + df["opt_1Cf"] * df["v"]
                    + df["opt_1Cg"]
                )
                * (1 - df["opt_1Ch"])
            )
        )
    )
)

# Calculating values for opt_2 using the formula
df["opt_2_value"] = (
    df["l"]
    * df["N_cars"]
    * (
        (
            df["opt_2Acategory %"]
            / 100
            * (
                df["opt_2Aa"] * df["v"] * df["v"]
                + df["opt_2Ab"] * df["v"]
                + df["opt_2Ac"]
                + df["opt_2Ad"] / df["v"]
            )
            / (
                (
                    df["opt_2Ae"] * df["v"] * df["v"]
                    + df["opt_2Af"] * df["v"]
                    + df["opt_2Ag"]
                )
                * (1 - df["opt_2Ah"])
            )
        )
        + (
            df["opt_2Bcategory %"]
            / 100
            * (
                df["opt_2Ba"] * df["v"] * df["v"]
                + df["opt_2Bb"] * df["v"]
                + df["opt_2Bc"]
                + df["opt_2Bd"] / df["v"]
            )
            / (
                (
                    df["opt_2Be"] * df["v"] * df["v"]
                    + df["opt_2Bf"] * df["v"]
                    + df["opt_2Bg"]
                )
                * (1 - df["opt_2Bh"])
            )
        )
        + (
            df["opt_2Ccategory %"]
            / 100
            * (
                df["opt_2Ca"] * df["v"] * df["v"]
                + df["opt_2Cb"] * df["v"]
                + df["opt_2Cc"]
                + df["opt_2Cd"] / df["v"]
            )
            / (
                (
                    df["opt_2Ce"] * df["v"] * df["v"]
                    + df["opt_2Cf"] * df["v"]
                    + df["opt_2Cg"]
                )
                * (1 - df["opt_2Ch"])
            )
        )
    )
)

# Calculating values for opt_3 using the formula
df["opt_3_value"] = (
    df["l"]
    * df["N_cars"]
    * (
        (
            df["opt_3Acategory %"]
            / 100
            * (
                df["opt_3Aa"] * df["v"] * df["v"]
                + df["opt_3Ab"] * df["v"]
                + df["opt_3Ac"]
                + df["opt_3Ad"] / df["v"]
            )
            / (
                (
                    df["opt_3Ae"] * df["v"] * df["v"]
                    + df["opt_3Af"] * df["v"]
                    + df["opt_3Ag"]
                )
                * (1 - df["opt_3Ah"])
            )
        )
        + (
            df["opt_3Bcategory %"]
            / 100
            * (
                df["opt_3Ba"] * df["v"] * df["v"]
                + df["opt_3Bb"] * df["v"]
                + df["opt_3Bc"]
                + df["opt_3Bd"] / df["v"]
            )
            / (
                (
                    df["opt_3Be"] * df["v"] * df["v"]
                    + df["opt_3Bf"] * df["v"]
                    + df["opt_3Bg"]
                )
                * (1 - df["opt_3Bh"])
            )
        )
        + (
            df["opt_3Ccategory %"]
            / 100
            * (
                df["opt_3Ca"] * df["v"] * df["v"]
                + df["opt_3Cb"] * df["v"]
                + df["opt_3Cc"]
                + df["opt_3Cd"] / df["v"]
            )
            / (
                (
                    df["opt_3Ce"] * df["v"] * df["v"]
                    + df["opt_3Cf"] * df["v"]
                    + df["opt_3Cg"]
                )
                * (1 - df["opt_3Ch"])
            )
        )
    )
)

# Removing unnecessary columns
df_final = df[["l", "N_cars", "v", "opt_1_value", "opt_2_value", "opt_3_value"]]
print(df_final)

Output:

      l  N_cars    v  opt_1_value  opt_2_value  opt_3_value
0   3.5    1000  100  1496.002370  1420.656629  1534.748740
1   5.7     500  110   750.997279   710.944885   767.411691
2  10.0     367  110   551.157686   521.754019   562.906668
3  11.1    1800   95  2685.551348  2554.477141  2756.164589
4   2.8     960  105  1439.467965  1364.815604  1475.082027
5   4.7     800  120  1206.116125  1138.614075  1229.225287
6  10.4     103  111   154.744048   146.445615   157.990346
7  20.1    1950  115  2933.825622  2773.297776  2990.828374
ddejohn
  • 8,775
  • 3
  • 17
  • 30
  • 1
    thank you for the proposed solution, as I am looking for something automatic, short and simple this seems overly manual / step by step calculation. I hope you understand :) also `1 - h` is indeed the correct subtraction term, but that's easy to change – ctrl_z Feb 07 '22 at 11:43
  • @ctrl_z, I fully understand :-) I realized my code needs to be shortened but I didn't know how. – Krishnakanth Allika Feb 07 '22 at 13:43
  • 2
    @ctrl_z could you at least confirm if this is the expected result? Two solutions posted so far gave different results. – steven Feb 19 '22 at 15:00
  • 1
    @KrishnakanthAllika you get a different result because you used wrong brackets in the formula, you calculate `formula = ( (a*v*v) + (b*v) + c + (d/v) ) / (( (e*v*v) + (f*v) + g) * (1 - h))` instead of `formula = ( (a*v*v) + (b*v) + c + (d/v) ) / ( (e*v*v) + (f*v) + g) * (1 - h)` – ctrl_z Feb 24 '22 at 09:41
1

Another way to do it, not nearly as elegand as @ddejhon 's solution, tho:

def formula(input_index, factors_index):
    formula = ((factors_df.loc[factors_index,'a']*input_df['v'][input_index]**2)+
               (factors_df.loc[factors_index,'b']*input_df['v'][input_index])+
               (factors_df.loc[factors_index,'c'])+
               (factors_df.loc[factors_index,'d']/input_df['v'][input_index])
            )/(
               (factors_df.loc[factors_index,'e']*input_df['v'][input_index]**2)+
               (factors_df.loc[factors_index,'f']*input_df['v'][input_index])+
               (factors_df.loc[factors_index,'g'])
           )*(1-factors_df.loc[factors_index,'h'])
    return formula
index_list = [factors_df[factors_df['option'] == opt].index.tolist() for opt in factors_df['option'].unique().tolist()]

Edit1: got rid of that ugly nested for structure and replaced it with list comprehension

output_df            = pd.DataFrame(np.repeat(input_df.values, len(factors_df['option'].unique()), axis=0))
output_df.columns    = input_df.columns
output_df['option']  = factors_df['option'].unique().tolist()*len(input_df.index)
output_df['formula'] = [n for sub_list in [[sum(factors_df['category %'].unique()[k]/100 * formula(i,j[k]) 
                          for k in range(len(factors_df['category'].unique()))) 
                          for j in index_list] for i in input_df.index] for n in sub_list]
output_df['result']  = output_df['l'] * output_df['N_cars'] * output_df['formula']

Output:

output_df

       l    N_cars      v  option    formula          result
 0   3.5    1000.0  100.0   opt_1   1.546196     5411.685077
 1   3.5    1000.0  100.0   opt_2   1.461442     5115.048256
 2   3.5    1000.0  100.0   opt_3   1.571710     5500.985916
 3   5.7     500.0  110.0   opt_1   1.552751     4425.339734
 4   5.7     500.0  110.0   opt_2   1.463121     4169.893681
 5   5.7     500.0  110.0   opt_3   1.573192     4483.595803
 6  10.0     367.0  110.0   opt_1   1.552751     5698.595376
 7  10.0     367.0  110.0   opt_2   1.463121     5369.652565
 8  10.0     367.0  110.0   opt_3   1.573192     5773.612841
 9  11.1    1800.0   95.0   opt_1   1.542578    30820.717985
10  11.1    1800.0   95.0   opt_2   1.460466    29180.106606
11  11.1    1800.0   95.0   opt_3   1.570810    31384.785443
12   2.8     960.0  105.0   opt_1   1.549580     4165.270216
13   2.8     960.0  105.0   opt_2   1.462324     3930.726187
14   2.8     960.0  105.0   opt_3   1.572499     4226.877893
15   4.7     800.0  120.0   opt_1   1.558526     5860.057879
16   4.7     800.0  120.0   opt_2   1.464497     5506.509637
17   4.7     800.0  120.0   opt_3   1.574334     5919.496692
18  10.4     103.0  111.0   opt_1   1.553361     1663.960420
19  10.4     103.0  111.0   opt_2   1.463271     1567.455541
20  10.4     103.0  111.0   opt_3   1.573319     1685.339848
21  20.1    1950.0  115.0   opt_1   1.555727    60976.735053
22  20.1    1950.0  115.0   opt_2   1.463842    57375.300546
23  20.1    1950.0  115.0   opt_3   1.573800    61685.075902
P. Pinho
  • 394
  • 1
  • 6