4

I work in a logistics company and we do B2C deliveries for our client. So we have a rate card in a form of a table and list of deliveries/ transaction, the weight of the package and the location where it was delivered.

I have seen a lot of SUMIFS question being answered here but is very different from the one I need.

1st transaction dataframe contains transaction id, weight, island, category

transaction   weight   island   category
1             0.3      luzon    regular
2             0.5      visayas  express
3             0.5      luzon    express
4             0.4      visayas  regular
5             1.7      visayas  regular
6             1.5      luzon    regular

2nd rate card dataframe contains category, min_weight,max_weight, fee

category    island  min weight  max weight  fee
regular     luzon     0            0.5       30
regular     luzon     0.51         3.0       40
express     luzon     0            3.0       45
regular     visayas   0            0.5       50
regular     visayas   0.51         3.0       60
express     visayas   0            3.0       65

So i want to calculate for the fee base on the weight of the package and the location. the resulting transaction dataframe should be

transaction      weight      island    category       fee
1                 0.3        luzon      regular        30
2                 0.5       visayas     express        65
3                 0.5        luzon      express        45
4                 0.4       visayas     regular        50
5                 1.7       visayas     regular        60
6                 1.5        luzon      regular        40

So here's the formula in EXCEL on how fees are calculated

=SUMIFS(rate_card.fee, rate_card.min_weight <= transaction.weight, rate_card.max_weight >= transaction.weight, rate_card.island = transaction.island, rate_card.category = transaction.category)

So i want to replicate this particular formula in Python using Pandas

Hopefully someone can provide a solution on my 1 month problem.

2 Answers2

2

This is merge on category and island and then query that weight from the transaction dataframe is between min weight and max weight:

new = transaction.merge(rate_card, on=['category', 'island'])\
                 .query('weight.between(`min weight`, `max weight`)')\
                 .sort_values('transaction')\
                 .drop(['min weight', 'max weight'], axis=1)

Or if your pandas < 0.25.0, it does not support the backtick column selection in query yet, then use:

new = transaction.merge(rate_card, on=['category', 'island'])

new = new.loc[new['weight'].between(new['min weight'], new['max weight'])]\
         .sort_values('transaction')\
         .drop(['min weight', 'max weight'], axis=1)

Output

   transaction  weight   island category  fee
0            1     0.3    luzon  regular   30
4            2     0.5  visayas  express   65
5            3     0.5    luzon  express   45
6            4     0.4  visayas  regular   50
9            5     1.7  visayas  regular   60
3            6     1.5    luzon  regular   40

Details: The first merge gives us:

transaction.merge(rate_card, on=['category', 'island'])

   transaction  weight   island category  min weight  max weight  fee
0            1     0.3    luzon  regular        0.00         0.5   30
1            1     0.3    luzon  regular        0.51         3.0   40
2            6     1.5    luzon  regular        0.00         0.5   30
3            6     1.5    luzon  regular        0.51         3.0   40
4            2     0.5  visayas  express        0.00         3.0   65
5            3     0.5    luzon  express        0.00         3.0   45
6            4     0.4  visayas  regular        0.00         0.5   50
7            4     0.4  visayas  regular        0.51         3.0   60
8            5     1.7  visayas  regular        0.00         0.5   50
9            5     1.7  visayas  regular        0.51         3.0   60

Then we filter all the rows where weight = between min weight, max weight:

new = transaction.merge(rate_card, on=['category', 'island'])\
                 .query('weight.between(`min weight`, `max weight`)')

   transaction  weight   island category  min weight  max weight  fee
0            1     0.3    luzon  regular        0.00         0.5   30
3            6     1.5    luzon  regular        0.51         3.0   40
4            2     0.5  visayas  express        0.00         3.0   65
5            3     0.5    luzon  express        0.00         3.0   45
6            4     0.4  visayas  regular        0.00         0.5   50
9            5     1.7  visayas  regular        0.51         3.0   60

The last two steps are to sort correctly and to drop unnecessary columns

Erfan
  • 40,971
  • 8
  • 66
  • 78
  • Hi @Erfan thank you for taking time on answering this question. I did run your solution line by line, the first line `transaction.merge(rate_card, on=['category', 'island'])` works fine, but i get an error with this line `.query('weight.between(`min weight`, `max weight`)')`. – lester paja Oct 20 '19 at 11:06
  • Yes that probably because you have `pandas < 0.25.0`, you can check with `print(pd.__version__)`. Please check the second solution I provided, that one should work since it does not use `.query` and it will be many times faster than the other answer you accepted as correct since these are all vectorized methods. @lesterpaja – Erfan Oct 20 '19 at 11:13
  • the second solution works! Thank you so much! I'll time both solutions with 100k transactions to see which is faster. – lester paja Oct 20 '19 at 11:38
  • Cool, could you let me know, curious as well.@lesterpaja – Erfan Oct 20 '19 at 11:46
1

Note: Not a recommended solution (performance concerns), might be more useful for creating tests to make sure a better version using merge works as expected...


With the following preparation:

import pandas as pd
from io import StringIO

transaction = pd.read_csv(StringIO("""
transaction   weight   island   category
1             0.3      luzon    regular
2             0.5      visayas  express
3             0.5      luzon    express
4             0.4      visayas  regular
5             1.7      visayas  regular
6             1.5      luzon    regular
"""), sep=r"\s+")

rate = pd.read_csv(StringIO("""
category    island  min_weight  max_weight  fee
regular     luzon     0            0.5       30
regular     luzon     0.51         3.0       40
express     luzon     0            3.0       45
regular     visayas   0            0.5       50
regular     visayas   0.51         3.0       60
express     visayas   0            3.0       65
"""), sep=r"\s+")

We can create a Pandas expression to calculate the sum of rates for the 1st transaction:

rate[(rate.min_weight <= 0.3) &
     (rate.max_weight >= 0.3) &
     (rate.island=="luzon") &
     (rate.category=="regular")].fee.sum()

And use that expression in a list comprehension for each row, then assign it to a new column like this:

transaction["fee"] = [
    rate[(rate.min_weight <= t.weight) &
         (rate.max_weight >= t.weight) &
         (rate.island == t.island) &
         (rate.category == t.category)].fee.sum()
    for t in transaction.itertuples()
]
Aprillion
  • 21,510
  • 5
  • 55
  • 89
  • Hi I've tried the given solution and it worked!! thanks a lot! may i know this particular line do: `for _, t in transaction.iterrows()` – lester paja Oct 20 '19 at 10:48
  • [df.iterrows()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html) returns an iterable of pairs `(index, data)` and I am using [sequence unpacking](https://docs.python.org/3/tutorial/datastructures.html?highlight=sequence%20unpacking#tuples-and-sequences) to assign the pair to 2 variables `_, t`, where the first one is just an underscore used as a convention for a variable name that won't be used for anything – Aprillion Oct 20 '19 at 12:06
  • after reading the above documentation, I simplified my code to use [`df.itertuples()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.itertuples.html#pandas.DataFrame.itertuples) – Aprillion Oct 20 '19 at 12:08
  • 1
    This answer works, but iterating over a pandas dataframe is one of the things you should avoid, since we have vectorized solutions in `pandas` and `numpy`. [Here's](https://stackoverflow.com/questions/16476924/how-to-iterate-over-rows-in-a-dataframe-in-pandas/55557758#55557758) a good read on it – Erfan Oct 20 '19 at 12:40