4

I have a dataframe, below:

ID Date     Volume Sales
1  2020-02   10     4
1  2020-03   8      6
2  2020-02   6      8
2  2020-03   4      10

Is there an easy way to convert this to weekly data using resampling? And dividing the volume and sales column by the number of weeks in the month?

I have started my process which code which looks like:

import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('date')
grouped = df.groupby('ID').resmaple('W').ffill().reset_index() 
print(grouped)

After this step, I get an error message: cannot inset ID, already exists

Also is there a code to use for finding the number of weeks in a month for dividing the volume and sales column by the number of weeks in the month.

The Expected output is :

ID      Volume  Sales      Weeks
0   1      2.5    1.0     2020-02-02
0   1      2.5    1.0     2020-02-09
0   1      2.5    1.0     2020-02-16
0   1      2.5    1.0     2020-02-23
1   1      1.6    1.2     2020-03-01
1   1      1.6    1.2     2020-03-08
1   1      1.6    1.2     2020-03-15
1   1      1.6    1.2     2020-03-22
1   1      1.6    1.2     2020-03-29
2   2      1.5    2       2020-02-02
2   2      1.5    2       2020-02-09
2   2      1.5    2       2020-02-16
2   2      1.5    2       2020-02-23
3   2      0.8    2       2020-03-01
3   2      0.8    2       2020-03-08
3   2      0.8    2       2020-03-15
3   2      0.8    2       2020-03-22
3   2      0.8    2       2020-03-29
SeaBean
  • 22,547
  • 3
  • 13
  • 25
tj judge
  • 626
  • 3
  • 17
  • Please don't spam other tags, I see no connection to the R programming language ([tag:r] removed). – r2evans Mar 30 '21 at 15:59
  • This conversion can definitely be handled on R.. no ? – tj judge Mar 30 '21 at 16:05
  • Yes! By that argument, many things that can be done in both R and Python should be co-tagged, which will create (imho) too much noise in topics where some people speak either R or Python but not both. If you want solutions in either of the languages, I suggest you explicitly state that in the question. The Stack tag-recommendation system is good but imperfect, so often seemingly-unrelated tags are included in questions. – r2evans Mar 30 '21 at 16:11
  • 1
    Got it, thanks for the note! – tj judge Mar 30 '21 at 16:14
  • 1
    @tj judge See my edit to add a new simpler solution below. I also discussed its differences from other similar question requiring resampling from monthly to weekly data and splitting values into equal proportions for the generated weeks. – SeaBean Apr 01 '21 at 13:15
  • 1
    @SeaBean thank you very much for your in depth response to this – tj judge Apr 01 '21 at 15:27

1 Answers1

5

After review, a much simpler solution can be used. Please refer to subsection labeled New Solution in Part 1 below.

This task requires multiple steps. Let's break it down as follows:

Part 1: Transform Date & Resample

New Solution

With consideration that the weekly frequency required, being Sunday based (i.e. freq='W-SUN') is independent for each month and is not related to or affected by any adjacent month(s), we can directly use the year-month values in column Date to generate date ranges in weekly basis in one step rather than breaking into 2 steps by first generating daily date ranges from year-month and then resample the daily date ranges to weekly afterwards.

The new program logics just needs to use pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month. Altogether, it does not need to call .resample() or .asfreq() like other solutions. Effectively, the pd.date_range() with freq='W' is doing the resampling task for us.

Here goes the codes:

df['Weeks'] = df['Date'].map(lambda x: 
                             pd.date_range(
                                 start=pd.to_datetime(x), 
                                 end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
                                 freq='W'))

df = df.explode('Weeks')

Result:

print(df)


   ID     Date  Volume  Sales      Weeks
0   1  2020-02      10      4 2020-02-02
0   1  2020-02      10      4 2020-02-09
0   1  2020-02      10      4 2020-02-16
0   1  2020-02      10      4 2020-02-23
1   1  2020-03       8      6 2020-03-01
1   1  2020-03       8      6 2020-03-08
1   1  2020-03       8      6 2020-03-15
1   1  2020-03       8      6 2020-03-22
1   1  2020-03       8      6 2020-03-29
2   2  2020-02       6      8 2020-02-02
2   2  2020-02       6      8 2020-02-09
2   2  2020-02       6      8 2020-02-16
2   2  2020-02       6      8 2020-02-23
3   2  2020-03       4     10 2020-03-01
3   2  2020-03       4     10 2020-03-08
3   2  2020-03       4     10 2020-03-15
3   2  2020-03       4     10 2020-03-22
3   2  2020-03       4     10 2020-03-29

By the 2 lines of codes above, we already get the required result for Part 1. We don't need to go through the complicated codes of .groupby() and .resample() in the old solution.

We can continue to go to Part 2. As we have not created the grouped object, we can either replace grouped by df in for the codes in Part 2 or add a new line grouped = df to continue.

Old Solution

We use pd.date_range() with freq='D' with the help of pd.offsets.MonthEnd() to produce daily entries for the full month. Then transform these full month ranges to index before resampling to week frequency. Resampled with closed='left' to exclude the unwanted week of 2020-04-05 produced under default resample() parameters.

df['Weeks'] = df['Date'].map(lambda x: 
                             pd.date_range(
                                 start=pd.to_datetime(x), 
                                 end=(pd.to_datetime(x) + pd.offsets.MonthEnd()),
                                 freq='D'))

df = df.explode('Weeks').set_index('Weeks')

grouped = (df.groupby(['ID', 'Date'], as_index=False)
             .resample('W', closed='left')
             .ffill().dropna().reset_index(-1))

Result:

print(grouped)


       Weeks   ID     Date  Volume  Sales
0 2020-02-02  1.0  2020-02    10.0    4.0
0 2020-02-09  1.0  2020-02    10.0    4.0
0 2020-02-16  1.0  2020-02    10.0    4.0
0 2020-02-23  1.0  2020-02    10.0    4.0
1 2020-03-01  1.0  2020-03     8.0    6.0
1 2020-03-08  1.0  2020-03     8.0    6.0
1 2020-03-15  1.0  2020-03     8.0    6.0
1 2020-03-22  1.0  2020-03     8.0    6.0
1 2020-03-29  1.0  2020-03     8.0    6.0
2 2020-02-02  2.0  2020-02     6.0    8.0
2 2020-02-09  2.0  2020-02     6.0    8.0
2 2020-02-16  2.0  2020-02     6.0    8.0
2 2020-02-23  2.0  2020-02     6.0    8.0
3 2020-03-01  2.0  2020-03     4.0   10.0
3 2020-03-08  2.0  2020-03     4.0   10.0
3 2020-03-15  2.0  2020-03     4.0   10.0
3 2020-03-22  2.0  2020-03     4.0   10.0
3 2020-03-29  2.0  2020-03     4.0   10.0

Here, we retain the column Date for some use later.

Part 2: Divide Volume and Sales by number of weeks in month

Here, the number of weeks in month used to divide the Volume and Sales figures should actually be the number of resampled weeks within the month as shown in the interim result above.

If we use the actual number of weeks, then for Feb 2020, because of leap year, it has 29 days in that month and thus it actually spans across 5 weeks instead of the 4 resampled weeks in the interim result above. Then it would cause inconsistent results because there are only 4 week entries above while we divide each Volume and Sales figure by 5.

Let's go to the codes then:

We group by columns ID and Date and then divide each value in columns Volume and Sales by group size (i.e. number of resampled weeks).

grouped[['Volume', 'Sales']] = (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
                                       .transform(lambda x: x / x.count()))

or simplified form using /= as follows:

grouped[['Volume', 'Sales']] /= (grouped.groupby(['ID', 'Date'])[['Volume', 'Sales']]
                                        .transform('count'))

Result:

print(grouped)


       Weeks   ID     Date  Volume  Sales
0 2020-02-02  1.0  2020-02     2.5    1.0
0 2020-02-09  1.0  2020-02     2.5    1.0
0 2020-02-16  1.0  2020-02     2.5    1.0
0 2020-02-23  1.0  2020-02     2.5    1.0
1 2020-03-01  1.0  2020-03     1.6    1.2
1 2020-03-08  1.0  2020-03     1.6    1.2
1 2020-03-15  1.0  2020-03     1.6    1.2
1 2020-03-22  1.0  2020-03     1.6    1.2
1 2020-03-29  1.0  2020-03     1.6    1.2
2 2020-02-02  2.0  2020-02     1.5    2.0
2 2020-02-09  2.0  2020-02     1.5    2.0
2 2020-02-16  2.0  2020-02     1.5    2.0
2 2020-02-23  2.0  2020-02     1.5    2.0
3 2020-03-01  2.0  2020-03     0.8    2.0
3 2020-03-08  2.0  2020-03     0.8    2.0
3 2020-03-15  2.0  2020-03     0.8    2.0
3 2020-03-22  2.0  2020-03     0.8    2.0
3 2020-03-29  2.0  2020-03     0.8    2.0

Optionally, you can do some cosmetic works to drop the column Date and rearrange column Weeks to your desired position if you like.

Edit: (Similarity and difference from other questions resampling from month to week)

In this review, I have searched some other questions of similar titles and compared the questions and solutions.

There is another question with similar requirement to split the monthly values equally to weekly values according to the number of weeks in the resampled month. In that question, the months are represented as the first date of the months and they are in datetime format and used as index in the dataframe while in this question, the months are represented as YYYY-MM which can be of string type.

A big and critical difference is that in that question, the last month period index 2018-05-01 with value 22644 was actually not processed. That is, the month of 2018-05 is not resampled into weeks in May 2018 and the value 22644 has never been processed to split into weekly proportions. The accepted solution using .asfreq() does not show any entry for 2018-05 at all and the other solution using .resample() still keeps one (un-resampled) entry for 2018-05 and the value 22644 is not split into weekly proportions.

However, in our question here, the last month listed in each group still needs to be resampled into weeks and values split equally for the resampled weeks.

Looking at the solution, my new solution makes no call to .resample() nor .asfreq(). It just uses pd.date_range() with freq='W' with the help of pd.offsets.MonthEnd() to generate weekly frequency for a month based on 'YYYY-MM' values. This is what I could not imagine of when I worked on the old solution making use of .resample()

SeaBean
  • 22,547
  • 3
  • 13
  • 25