I have a dataframe with 19M rows of different customers (~10K customers) and for their daily consumption over different date ranges. I have resampled this data into weekly consumption and the resulted dataframe is 2M rows. I want to know the ranges of consecutive dates for each customer and select those with the max(range). Any ideas? Thank you!
Asked
Active
Viewed 51 times
1 Answers
0
It would be great if you could post some example code, so the replies will be more specific.
You probably want to do something like earliest = df.groupby('Customer_ID').min()['Consumption_date']
to get the earliest consumption date per customer, and latest = df.groupby('Customer_ID').max()['Consumption_date']
for the latest consumption date, and then take the difference time_span = latest-earliest
to get the time span per customer.
Knowing the specific df and variable names would be great

KingOtto
- 840
- 5
- 18
-
hello and thanks for your relpy! In this way i can get the range of the data for each customer, but i dont know if the dates betwenn the min and max date (as specified by your code) will be consecutive! – dogo Feb 16 '22 at 15:56
-
they will not be consecutive - simply the min and the max... Maybe you can update your question a bit, add a data example, and show the desired output. This way it will be much simpler to advise. A general piece of advice: `pd.diff()` allows you to take differences.. I imagine you need some form of grouping by customer, then sorting by date, then taking diff, then selecting all with diff == 1 (consecutive), then do something with it. but again, your question is not 100% clear to me – KingOtto Feb 16 '22 at 16:29