0

everyone:

Here is my question related to the pandas package how to fill in the rows which are missing in the sequencial calendar.

Background:

the table is a sample of my dataset with sale record. As you know, some of products are poor sales. Therefore, we can find some records are absent for "Category-A & product-seed" in 201003 -201005. Hence, it is hard for me to calculate the "sequential growth rate%" for each group in catagory-product.

Initially, I would like to use "groupby+apply" to dig out which periods are lost for each group, then I can recover and "pct_change" them. While it don't work. I don't know where is the root cause...

If you know how to do that, could you share with us your opinion? Appreciate!

Dataset: enter image description here

Calendar:

enter image description here

Result: enter image description here

Add some information:

my calendar is a srting that is consist of "month/quarter/semi-annua/annuall" insteal of date-time format. For example, 2010Q1,or 2019H1. So I hope there is a way to fill in the missing rows by my specific calendar.

In otherwords, I would like to do the first step is to calculate which rows are absent between my specific calendar. And then seconde step is that python can inser the missing rows with catagory and product infomation. Thanks.

MaxxiHuang
  • 37
  • 4
  • Does this answer your question? [Add missing dates to pandas dataframe](https://stackoverflow.com/questions/19324453/add-missing-dates-to-pandas-dataframe) – RichieV Jul 26 '20 at 03:31
  • If there are not inmensely many products and all dates appear at least once, you could `df.set_index(['month', 'category', 'product']).unstack('categroy').unstack('product')` – RichieV Jul 26 '20 at 03:35
  • I think it's not suit to my dataset by the given post, because my calendar is a srting that is consist of "month/quarter/semi-annua/annuall" insteal of date-time format. For example, 2010Q1,or 2019H1. So I hope there is a way to fill in the missing rows by my specific calendar. In otherwords, I would like to do the first step is to calculate which rows are absent between my specific calendar. And then seconde step is that python can inser the missing rows with catagory and product infomation. Thanks. – MaxxiHuang Jul 26 '20 at 12:27
  • `df.reindex` works on any index, not just dates, so if you `df.set_index(['Month', 'Category', 'Product'])` then build all possible combinations with `newidx = pd.MultiIndex.from_product([all_months, all_cats, all_products])` and `df.reindex(newidx)` you should be set – RichieV Jul 26 '20 at 13:00

1 Answers1

1

So depending on what you have on your data this can achieved efficiently in several ways. I will point out two.

First the data:

df = pd.DataFrame(
    {'Month': [201001, 201002, 201006, 201007, 201008, 201001, 201002, 201007, 201008],
    'Category': ['A'] * 9,
    'Product': ['seed'] * 5 + ['flower'] * 4,
    'Sales': [200, 332, 799, 122, 994, 799, 122, 994, 100]}
    ).set_index(['Month', 'Category', 'Product'])

Reshape df

This will work only if ALL POSSIBLE DATES appear at least once in the df.

df = df.unstack(['Category', 'Product']).fillna(0).stack(['Category', 'Product'])
print(df.reset_index())

Output

    Month Category Product  Sales
0  201001        A  flower  799.0
1  201001        A    seed  200.0
2  201002        A  flower  122.0
3  201002        A    seed  332.0
4  201006        A  flower    0.0
5  201006        A    seed  799.0
6  201007        A  flower  994.0
7  201007        A    seed  122.0
8  201008        A  flower  100.0
9  201008        A    seed  994.0

As you can see, this sample data does not include months 3-5

Reindex

If we build a new Index with all possible combinations of date/product pandas will add the missing rows with df.reindex()

months = np.arange(201001, 201008, dtype=np.int)
cats = ['A']
products =['seed', 'flower']
df = df.reindex(
    index=pd.MultiIndex.from_product(
        [months, cats, products],
        names=df.index.names),
    fill_value=0)

print(df.reset_index())

Output

     Month Category Product  Sales
0   201001        A    seed    200
1   201001        A  flower    799
2   201002        A    seed    332
3   201002        A  flower    122
4   201003        A    seed      0
5   201003        A  flower      0
6   201004        A    seed      0
7   201004        A  flower      0
8   201005        A    seed      0
9   201005        A  flower      0
10  201006        A    seed    799
11  201006        A  flower      0
12  201007        A    seed    122
13  201007        A  flower    994
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • For the purpose of your project, I believe the best option would be unstacking and NOT restacking then reindex with all dates... now you have a column for each product and can compute growth rates (or any other stat) by column – RichieV Jul 26 '20 at 13:30
  • You are genius. Appreciate your help! – MaxxiHuang Jul 27 '20 at 14:13