1

I have a dataframe similar to this one:

data = {
    "index": [0, 1, 2, 3, 4, 5, 6, 7, 8],
    "batchId": [100, 100, 100, 100, 100, 200, 200, 200, 200],
    "amount": [50, 100, 150, 200, 250, 300, 350, 400, 450],
    "date": [
        "2020-11-24 15:00:00",
        "2020-11-24 15:00:00",
        "2020-11-25 15:00:00",
        "2020-11-25 15:00:00",
        "2020-11-25 15:00:00",
        "2020-11-25 15:00:00",
        "2020-11-25 15:00:00",
        "2020-11-25 15:00:00",
        "2020-11-26 15:00:00",
    ],
}
data_frame = pd.DataFrame(data)  

   index  batchId  amount                 date
0      0      100      50  2020-11-24 15:00:00
1      1      100     100  2020-11-24 15:00:00
2      2      100     150  2020-11-25 15:00:00
3      3      100     200  2020-11-25 15:00:00
4      4      100     250  2020-11-25 15:00:00
5      5      200     300  2020-11-25 15:00:00
6      6      200     350  2020-11-25 15:00:00
7      7      200     400  2020-11-25 15:00:00
8      8      200     450  2020-11-26 15:00:00

I'm trying to group by the batchId column, then sum by the amount column and also add the latest date of the summed batch.

So far I have only managed to sum up the amounts based on the batchId:

batch_amounts = (
    data_frame["amount"].groupby(data_frame["batchId"]).sum().reset_index()
)

Which gives:

   batchId  amount
0      100     750
1      200    1500

How can I also add the date column that will show the latest date of each batch?

   batchId  amount                 date
0      100     750  2020-11-25 15:00:00
1      200    1500  2020-11-26 15:00:00
Valip
  • 4,440
  • 19
  • 79
  • 150

1 Answers1

1

You could use agg:

data_frame["date"] = pd.to_datetime(data_frame["date"])
result = data_frame.groupby(["batchId"], as_index=False).agg(
    {"amount": "sum", "date": "max"}
)

print(result)

Output

   batchId  amount                 date
0      100     750  2020-11-25 15:00:00
1      200    1500  2020-11-26 15:00:00
Ch3steR
  • 20,090
  • 4
  • 28
  • 58
Dani Mesejo
  • 61,499
  • 6
  • 49
  • 76