2

I have a dataframe that looks like this:

    province    date        D I R C
360 Incheon     2020-01-20  0 0 1 1
455 Seoul       2020-01-23  0 0 1 1
183 Gyeonggi-do 2020-01-26  0 0 1 1
184 Gyeonggi-do 2020-01-27  0 0 2 2
456 Seoul       2020-01-30  0 0 4 4

And I need that, for every subsequent day after the first one (2020-01-20 in this case), I have the "last seen row" for each province, in case there are no updates in that date. So in the case of my example, I would want as a resulting df:

    province    date        D I R C
1   Incheon     2020-01-20  0 0 1 1

2   Incheon     2020-01-23  0 0 1 1
3   Seoul       2020-01-23  0 0 1 1

4   Gyeonggi-do 2020-01-26  0 0 1 1
5   Incheon     2020-01-26  0 0 1 1
6   Seoul       2020-01-26  0 0 1 1

7   Gyeonggi-do 2020-01-27  0 0 2 2
8   Incheon     2020-01-27  0 0 1 1
9   Seoul       2020-01-27  0 0 1 1

10  Seoul       2020-01-30  0 0 4 4
11  Gyeonggi-do 2020-01-30  0 0 2 2
12  Incheon     2020-01-30  0 0 1 1

Have tried what can be seen here, here, and here, but it is slightly different. My goal is to make a bar plot with time using plotly, which I can right now, but for every date that a province does not have any updates, i.e., no data, the bar disappears, and I need it to stay.

I appreciate any help.

1 Answers1

3

You can do it with set_index and reindex with a MultiiIndex.from_product from all combinations of unique province and date, then groupby the province and ffill, dropna to remove the dates before the first occurrence of each province, sort_index per date (if necessary) and reset_index like:

(df.set_index(['province', 'date'])
   .reindex(pd.MultiIndex.from_product([df['province'].unique(), df['date'].unique()], 
                                       names=['province', 'date']))
   .groupby(level='province').ffill()
   .dropna()
   .sort_index(level='date')
   .reset_index()
)
       province        date    D    I    R    C
0       Incheon  2020-01-20  0.0  0.0  1.0  1.0
1       Incheon  2020-01-23  0.0  0.0  1.0  1.0
2         Seoul  2020-01-23  0.0  0.0  1.0  1.0
3   Gyeonggi-do  2020-01-26  0.0  0.0  1.0  1.0
4       Incheon  2020-01-26  0.0  0.0  1.0  1.0
5         Seoul  2020-01-26  0.0  0.0  1.0  1.0
6   Gyeonggi-do  2020-01-27  0.0  0.0  2.0  2.0
7       Incheon  2020-01-27  0.0  0.0  1.0  1.0
8         Seoul  2020-01-27  0.0  0.0  1.0  1.0
9   Gyeonggi-do  2020-01-30  0.0  0.0  2.0  2.0
10      Incheon  2020-01-30  0.0  0.0  1.0  1.0
11        Seoul  2020-01-30  0.0  0.0  4.0  4.0
Ben.T
  • 29,160
  • 6
  • 32
  • 54
  • Unfortunately, it didn't work, I get a warning "cannot insert province, already exists". Even if I drop the attribute `province` before resetting the index in the last step (fixing the warning), the resulting df here is not the expected, with wrong values in many places – Pedro Germani May 13 '20 at 17:40