2

My raw data looks like this:

raw_data = [
    {
        "date": "02.08.2020",
        "status": "A",
        "id": 3948,
    },
    {
        "date": "02.08.2020",
        "status": "B",
        "id": 495,
    },
    {
        "date": "03.08.2020",
        "status": "A",
        "id": 433,
    },
    {
        "date": "03.08.2020",
        "status": "B",
        "id": 845,
    },
    {
        "date": "03.08.2020",
        "status": "B",
        "id": 54,
    },
    {
        "date": "03.08.2020",
        "status": "C",
        "id": 133,
    },
    {
        "date": "04.08.2020",
        "status": "B",
        "id": 384,
    },
    {
        "date": "04.08.2020",
        "status": "C",
        "id": 1234,
    },
]

Then I create a pandas dataframe from it and group by date and status to get the count per date, per status:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame(raw_data)
grp = df.groupby(['date', 'status']).size().to_frame(name = 'count').reset_index()

print(grp)

The output:

0  02.08.2020      A      1
1  02.08.2020      B      1
2  03.08.2020      A      1
3  03.08.2020      B      2
4  03.08.2020      C      1
5  04.08.2020      B      1
6  04.08.2020      C      1

I then chart this using seaborn:

sns.barplot(x="date", y="count", hue="status", data=grp)
plt.show()

And the chart looks like this:

enter image description here

Now that's cool, it visualises the status per day. But what I would like to do is fill the "gaps" between the days, because I might need to actually visualise the data starting from 01.08.2020 to 05.08.2020. I would like my dataframe to look like this:

0   01.08.2020      A      0    <<< added 3 empty rows for 01.08.2020 bc. no data
1   01.08.2020      B      0    <<<
2   01.08.2020      C      0    <<<
3   02.08.2020      A      1
4   02.08.2020      B      1
5   03.08.2020      A      1
6   03.08.2020      B      2
7   03.08.2020      C      1
8   04.08.2020      B      1
9   04.08.2020      C      1
10  05.08.2020      A      0    <<< added 3 empty rows for 05.08.2020 bc. no data
11  05.08.2020      B      0    <<<
12  05.08.2020      C      0    <<<

The idea is to have all days from 01.08 to 05.08 in the charts x-axis, even if there is no data for a given day. The goal is to make the x-axis continuous with all days from 01.08.2020 to 05.08.2020, data or no data.

I had a look at another SO question which is about adding missing days in a datetime index. It's kind of what I want, but I also need to have the status per day empty I guess.

Can I create a dataframe like I want with builtin pandas functionality?

Max
  • 15,693
  • 14
  • 81
  • 131
  • the y axis anyway starts at 0, when you fill the missing dates with 0 what do you expect to plot? – anky Aug 07 '20 at 19:46
  • 1
    I expect to plot all days on the x-axis. – Max Aug 07 '20 at 19:48
  • Do you really need to fill in your dataframe, or just change the plot? – Paul H Aug 08 '20 at 00:08
  • @PaulH changing the plot is what I want to do. I thought this only works by having the data in the dataframe structured in this way. – Max Aug 08 '20 at 12:57

2 Answers2

2

Globally, you can use reindex at some point, here is one way to create grp with the missing value you want. One could go by using groupby.size as you did as well for similar idea result

grp = (df.pivot_table(index='date', columns='status',
                      aggfunc='size', fill_value=0)
         .reindex(pd.Index(pd.date_range('2020-08-01', '2020-08-05').strftime('%d.%m.%Y'),
                           name='date'), 
                  fill_value=0)
         .stack()
         .reset_index(name='count')
      )
print(grp)
          date status  count
0   01.08.2020      A      0
1   01.08.2020      B      0
2   01.08.2020      C      0
3   02.08.2020      A      1
4   02.08.2020      B      1
5   02.08.2020      C      0
6   03.08.2020      A      1
7   03.08.2020      B      2
8   03.08.2020      C      1
9   04.08.2020      A      0
10  04.08.2020      B      1
11  04.08.2020      C      1
12  05.08.2020      A      0
13  05.08.2020      B      0
14  05.08.2020      C      0
Ben.T
  • 29,160
  • 6
  • 32
  • 54
2

Another way using Multi-index and re-index:

idx = pd.date_range('2020-08-01','2020-08-05').strftime('%d.%m.%Y')
idx1 = pd.MultiIndex.from_product((idx,(df['status'].unique())),names=['date','status'])
grp = (df.groupby(['date', 'status']).size().reindex(idx1,fill_value=0)
         .reset_index(name="count"))

sns.barplot(x="date", y="count", hue="status", data=grp)
plt.show()

enter image description here

anky
  • 74,114
  • 11
  • 41
  • 70