Fill missing composite indexes in pandas dataframe

Question

My raw data looks like this:

raw_data = [
    {
        "date": "02.08.2020",
        "status": "A",
        "id": 3948,
    },
    {
        "date": "02.08.2020",
        "status": "B",
        "id": 495,
    },
    {
        "date": "03.08.2020",
        "status": "A",
        "id": 433,
    },
    {
        "date": "03.08.2020",
        "status": "B",
        "id": 845,
    },
    {
        "date": "03.08.2020",
        "status": "B",
        "id": 54,
    },
    {
        "date": "03.08.2020",
        "status": "C",
        "id": 133,
    },
    {
        "date": "04.08.2020",
        "status": "B",
        "id": 384,
    },
    {
        "date": "04.08.2020",
        "status": "C",
        "id": 1234,
    },
]

Then I create a pandas dataframe from it and group by date and status to get the count per date, per status:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(raw_data)
grp = df.groupby(['date', 'status']).size().to_frame(name = 'count').reset_index()

print(grp)

The output:

0  02.08.2020      A      1
1  02.08.2020      B      1
2  03.08.2020      A      1
3  03.08.2020      B      2
4  03.08.2020      C      1
5  04.08.2020      B      1
6  04.08.2020      C      1

I then chart this using seaborn:

sns.barplot(x="date", y="count", hue="status", data=grp)
plt.show()

And the chart looks like this:

Now that's cool, it visualises the status per day. But what I would like to do is fill the "gaps" between the days, because I might need to actually visualise the data starting from 01.08.2020 to 05.08.2020. I would like my dataframe to look like this:

0   01.08.2020      A      0    <<< added 3 empty rows for 01.08.2020 bc. no data
1   01.08.2020      B      0    <<<
2   01.08.2020      C      0    <<<
3   02.08.2020      A      1
4   02.08.2020      B      1
5   03.08.2020      A      1
6   03.08.2020      B      2
7   03.08.2020      C      1
8   04.08.2020      B      1
9   04.08.2020      C      1
10  05.08.2020      A      0    <<< added 3 empty rows for 05.08.2020 bc. no data
11  05.08.2020      B      0    <<<
12  05.08.2020      C      0    <<<

The idea is to have all days from 01.08 to 05.08 in the charts x-axis, even if there is no data for a given day. The goal is to make the x-axis continuous with all days from 01.08.2020 to 05.08.2020, data or no data.

I had a look at another SO question which is about adding missing days in a datetime index. It's kind of what I want, but I also need to have the status per day empty I guess.

Can I create a dataframe like I want with builtin pandas functionality?

the y axis anyway starts at 0, when you fill the missing dates with 0 what do you expect to plot? — anky, Aug 07 '20 at 19:46
Do you really need to fill in your dataframe, or just change the plot? — Paul H, Aug 08 '20 at 00:08
@PaulH changing the plot is what I want to do. I thought this only works by having the data in the dataframe structured in this way. — Max, Aug 08 '20 at 12:57

score 2 · Answer 1 · answered Aug 07 '20 at 19:57

Globally, you can use reindex at some point, here is one way to create grp with the missing value you want. One could go by using groupby.size as you did as well for similar idea result

grp = (df.pivot_table(index='date', columns='status',
                      aggfunc='size', fill_value=0)
         .reindex(pd.Index(pd.date_range('2020-08-01', '2020-08-05').strftime('%d.%m.%Y'),
                           name='date'), 
                  fill_value=0)
         .stack()
         .reset_index(name='count')
      )
print(grp)
          date status  count
0   01.08.2020      A      0
1   01.08.2020      B      0
2   01.08.2020      C      0
3   02.08.2020      A      1
4   02.08.2020      B      1
5   02.08.2020      C      0
6   03.08.2020      A      1
7   03.08.2020      B      2
8   03.08.2020      C      1
9   04.08.2020      A      0
10  04.08.2020      B      1
11  04.08.2020      C      1
12  05.08.2020      A      0
13  05.08.2020      B      0
14  05.08.2020      C      0

score 2 · Accepted Answer · answered Aug 07 '20 at 20:02

Another way using Multi-index and re-index:

idx = pd.date_range('2020-08-01','2020-08-05').strftime('%d.%m.%Y')
idx1 = pd.MultiIndex.from_product((idx,(df['status'].unique())),names=['date','status'])
grp = (df.groupby(['date', 'status']).size().reindex(idx1,fill_value=0)
         .reset_index(name="count"))

sns.barplot(x="date", y="count", hue="status", data=grp)
plt.show()

Fill missing composite indexes in pandas dataframe

2 Answers2