2

I have a Pandas dataframe with the following columns

date | months | price

I calculate some basic BI metrics. I did the Net Revenue by grouping the dataframe on date and sum the price:

df = df[["Date", "Price"]].groupby(df['Date'])["Price"].sum().reset_index()

Now, I want to find the MRR, which is similar to the Net Revenue, but in case the column months have more than 1 month, the price should be "moved" equally to the next months. And also, it is grouped by month and not day.

For example, if I am on January 2016 and I have a row with 3 months and price 30$, I should add 10$ to January, 10$ to February and 10$ to March.

My first idea was to iterate through the dataframe, keep track of the months and the amount of price I should "move" on next months and create a new dataframe manually.

But, first, is there any Pythonic way in Pandas to do it?

Data to reproduce a dataframe:

import pandas as pd
df = pd.DataFrame({'date': ['01-01-2016', '05-01-2016', '10-01-2016','04-02-2016'], 
                   'months': [1, 3, 1, 6],
                   'price': [40, 60, 20, 60]})

Desired result:

Date         | MRR
January 2016 | 80
February 2016| 30
March 2016   | 10
April 2016   | 10
May 2016     | 10
June 2016    | 10
July 2016    | 10

And the results calculated like this for each row

January 2016 = 40 + 20 + 20 + 0
February 2016 = 0 + 20 + 0 + 10
March 2016 = 0 + 0 + 0 + 10
April 2016 = 0 + 0 + 0 + 10
May 2016 = 0 + 0 + 0 + 10
June 2016 = 0 + 0 + 0 + 10
July 2016 = 0 + 0 + 0 + 10
Tasos
  • 7,325
  • 18
  • 83
  • 176

1 Answers1

1

I don't know any way around using a loop. However, I can suggest a way to make the code pretty clean and efficient.

First, let's load the example data you supplied in the question text:

df = pd.DataFrame({'date': ['01-01-2016', '05-01-2016', '10-01-2016','04-02-2016'], 
                   'months': [1, 3, 1, 6],
                   'price': [40, 60, 20, 60]})

In order to use Panda's date functionality (e.g. grouping by month), we will use the date column as index. A DateTimeIndex in fact:

df['date'] = pd.to_datetime(df['date'], format='%d-%m-%Y')
df = df.set_index('date')

Now, it's really easy to, for example, view a month-by-month summary, by using the resample function that works like the groupby function you already know, but uses time-periods:

df.resample('M').sum()

Now to "spread out" rows where the months column is > 1 over multiple months. My approach here is to generate a new DataFrame for each row:

dfs = []
for date, values in df.iterrows():
    months, price = values
    dfs.append(
        pd.DataFrame(
            # Compute the price for each month, and repeat this value
            data={'price': [price / months] * months},
            # The index is a date range for the requested number of months
            index=pd.date_range(date, periods=months, freq='M')
        )
    )

Now we can just concatenate the list of DataFrames, resample to months and take the sum:

pd.concat(dfs).resample('M').sum()

Output:

            price
2016-01-31     80
2016-02-29     30
2016-03-31     30
2016-04-30     10
2016-05-31     10
2016-06-30     10
2016-07-31     10

See http://pandas.pydata.org/pandas-docs/stable/timeseries.html for all the cool things Panda's can do regarding time. For example, to exactly produce your desired output you could do this:

output.index = output.index.strftime('%B %Y')

Which results in this:

               price
January 2016      80
February 2016     30
March 2016        30
April 2016        10
May 2016          10
June 2016         10
July 2016         10
Marijn van Vliet
  • 5,239
  • 2
  • 33
  • 45
  • Tried a couple of times, but it always return a single value and not a dataframe with the months and the sum of prices. – Tasos Oct 14 '16 at 08:35