9

Imagine I have a dataframe that looks like:

ID      DATE         VALUE
1    31-01-2006        5
1    28-02-2006        5
1    31-05-2006        10
1    30-06-2006        11
2    31-01-2006        5
2    31-02-2006        5
2    31-03-2006        5
2    31-04-2006        5

As you can see this is panel data with multiple entries on the same date for different IDs. What I want to do is fill in missing dates for each ID. You can see that for ID "1" there is a jump in months between the second and third entry.

I would like a dataframe that looks like:

ID      DATE         VALUE
1    31-01-2006        5
1    28-02-2006        5
1    31-03-2006        NA
1    30-04-2006        NA
1    31-05-2006        10
1    30-06-2006        11
2    31-01-2006        5
2    31-02-2006        5
2    31-03-2006        5
2    31-04-2006        5

I have no idea how to do this since I can not index by date since there are duplicate dates.

Chris
  • 433
  • 4
  • 17

2 Answers2

8

One way is to use pivot_table and then unstack:

In [11]: df.pivot_table("VALUE", "DATE", "ID")
Out[11]:
ID             1    2
DATE
28-02-2006   5.0  NaN
30-06-2006  11.0  NaN
31-01-2006   5.0  5.0
31-02-2006   NaN  5.0
31-03-2006   NaN  5.0
31-04-2006   NaN  5.0
31-05-2006  10.0  NaN

In [12]: df.pivot_table("VALUE", "DATE", "ID").unstack().reset_index()
Out[12]:
    ID        DATE     0
0    1  28-02-2006   5.0
1    1  30-06-2006  11.0
2    1  31-01-2006   5.0
3    1  31-02-2006   NaN
4    1  31-03-2006   NaN
5    1  31-04-2006   NaN
6    1  31-05-2006  10.0
7    2  28-02-2006   NaN
8    2  30-06-2006   NaN
9    2  31-01-2006   5.0
10   2  31-02-2006   5.0
11   2  31-03-2006   5.0
12   2  31-04-2006   5.0
13   2  31-05-2006   NaN

An alternative, perhaps slightly more efficient way is to reindex from_product:

In [21] df1 = df.set_index(['ID', 'DATE'])

In [22]: df1.reindex(pd.MultiIndex.from_product(df1.index.levels))
Out[22]:
              VALUE
1 28-02-2006    5.0
  30-06-2006   11.0
  31-01-2006    5.0
  31-02-2006    NaN
  31-03-2006    NaN
  31-04-2006    NaN
  31-05-2006   10.0
2 28-02-2006    NaN
  30-06-2006    NaN
  31-01-2006    5.0
  31-02-2006    5.0
  31-03-2006    5.0
  31-04-2006    5.0
  31-05-2006    NaN
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • @Chris right, it's going to be size N*M always. To get rid of the outside NaN you can do something like `df1 = df.pivot_table("VALUE", "DATE", "ID").unstack().reset_index(name='VALUE') df1[df1.groupby('ID').VALUE.ffill().notnull() & df1.groupby('ID').VALUE.bfill().notnull()]` . – Andy Hayden Feb 01 '19 at 18:34
  • @Chris you can pass a list to `df.pivot_table`, i.e. `df1 = df.pivot_table(["VALUE"], "ID", "DATE").stack(1, dropna=False); df1[df1.groupby(level='ID').ffill()[df1.columns].notnull().any(1) & df1.groupby(level='ID').bfill()[df1.columns].notnull().any(1)]` – Andy Hayden Feb 01 '19 at 20:12
  • @Chris yes, e.g. ["VALUE1", "VALUE2"] – Andy Hayden Feb 01 '19 at 20:36
  • @Chris Are you suggesting that doesn't work? (I have tested it, it does.) Why the unaccept? I think you should ask this as a new question if there is more columns (and my answer doesn't work). I meant `df1 = df.pivot_table(["VALUE1", "VALUE2"], "ID", "DATE").stack(1, dropna=False)`, df1.columns in my code above is correct. – Andy Hayden Feb 01 '19 at 22:14
1

Another solution is to convert the incomplete data to a "wide" form (a table; this will create cells for the missing values) and then back to a "tall" form.

df.set_index(['ID','DATE']).unstack().stack(dropna=False).reset_index()
#    ID        DATE  VALUE
#0    1  28-02-2006    5.0
#1    1  30-06-2006   11.0
#2    1  31-01-2006    5.0
#3    1  31-02-2006    NaN
#4    1  31-03-2006    NaN
#5    1  31-04-2006    NaN
#6    1  31-05-2006   10.0
#7    2  28-02-2006    NaN
#....
DYZ
  • 55,249
  • 10
  • 64
  • 93