1

I have a dataframe with 19M rows of different customers (~10K customers) and for their daily consumption over different date ranges. I have resampled this data into weekly consumption and the resulted dataframe is 2M rows. I want to know the ranges of consecutive dates for each customer and select those with the max(range). Any ideas? Thank you!

dogo
  • 73
  • 1
  • 4

1 Answers1

0

It would be great if you could post some example code, so the replies will be more specific.

You probably want to do something like earliest = df.groupby('Customer_ID').min()['Consumption_date'] to get the earliest consumption date per customer, and latest = df.groupby('Customer_ID').max()['Consumption_date'] for the latest consumption date, and then take the difference time_span = latest-earliest to get the time span per customer.

Knowing the specific df and variable names would be great

KingOtto
  • 840
  • 5
  • 18
  • hello and thanks for your relpy! In this way i can get the range of the data for each customer, but i dont know if the dates betwenn the min and max date (as specified by your code) will be consecutive! – dogo Feb 16 '22 at 15:56
  • they will not be consecutive - simply the min and the max... Maybe you can update your question a bit, add a data example, and show the desired output. This way it will be much simpler to advise. A general piece of advice: `pd.diff()` allows you to take differences.. I imagine you need some form of grouping by customer, then sorting by date, then taking diff, then selecting all with diff == 1 (consecutive), then do something with it. but again, your question is not 100% clear to me – KingOtto Feb 16 '22 at 16:29