0

I have a data set that has has the sales amount and profit of a given company in a given month. In some months, a company won't conduct any business, and thus it will be missing from the data for that given month. I would like to create rows for these missing months and fill the amount/profit variables with 0.

Code to Reproduce the Data:

import pandas as pd
import numpy as np
import itertools
from datetime import datetime

np.random.seed(2021)

company_list = ['Company_A', 'Company_B', 'Company_C']
datelist = pd.date_range(start="2020-01-01", end='2020-08-01', freq='MS').to_list()

df = pd.DataFrame(list(itertools.product(company_list, datelist)))
df.columns = ['Company', 'Date']

df['Amount'] = np.random.choice(range(0,10000), df.shape[0])
df['Profit'] = np.random.choice(range(0,10000), df.shape[0])

df['Date'] = df['Date'] + pd.tseries.offsets.MonthEnd(1)

df = df.drop([0,1,5,10,11,15,16,17,22,23]).reset_index(drop=True) #Dropping some rows just to illustrate what my data looks like

df

enter image description here

I've figure out how to do this using the help I found in this post, however it uses the min/max ranges of all dates in the 'Date' column. Because the observations for my companies begin and end at different times, this adds dates from before I first observe them and after they disappear from the data, which is not what I want.

df.set_index(
    ['Date', 'Company']
).unstack(
    fill_value = 0
).asfreq(
    'M'
).stack().sort_index(level=1).reset_index()

enter image description here

What I want instead is to only add the missing dates that would result from the min/max dates of the specific group, i.e. just the dates missing in between, not on the ends.

How the Data should look:

enter image description here

  • How about you just delete the row with `0` value or the date not in your original dataframe at end after you fill it uses the min/max ranges like you currently using? – Mr. For Example Feb 03 '21 at 02:20

1 Answers1

1

This will do the trick:

df.drop_duplicates(
  ['Date', 'Company'], 'last'
).groupby('Company').apply(
  lambda x: x.set_index('Date').asfreq('M', fill_value = 0)
).drop('Company', axis = 1).reset_index()