1

my initial dataframe is

df = pd.DataFrame({"a":["2020-01-01", "2020-01-06", "2020-01-04", "2020-01-07"],
                   "b":["a", "a", "b", "b"],
                   "c":[1, 2, 3,4]})
print(df)
            a  b  c
0  2020-01-01  a  1
1  2020-01-06  a  2
2  2020-01-04  b  3
3  2020-01-07  b  4

I want my dataset to be like this

            a  b  c
0  2020-01-01  a  1
1  2020-01-02  a  NaN
2  2020-01-03  a  NaN
3  2020-01-04  a  NaN
4  2020-01-05  a  NaN
5  2020-01-06  a  2
6  2020-01-04  b  3
7  2020-01-05  b  NaN
8  2020-01-06  b  NaN
3  2020-01-07  b  4

I tried

d.set_index([d.a, d.b], inplace=True)
d.asfreq("D")

d.set_index([d.a, d.b], inplace=True)
d.resample("D")

but I encountered

TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
enter code here

My real DataFrame's column ('b' column in this example) has many unique values.

권순규
  • 33
  • 5

3 Answers3

4
df = pd.DataFrame({"a":["2020-01-01", "2020-01-06", "2020-01-04", "2020-01-07"],
                   "b":["a", "a", "b", "b"],
                   "c":[1, 2, 3,4]})
# make datetime
df['a'] = pd.to_datetime(df['a'])
# create a group
g = df.groupby('b')
# list comprehension with reindex and date_range then concat list of frames
df2 = pd.concat([df.set_index('a').reindex(pd.date_range(df['a'].min(),
                                                         df['a'].max(),freq='D')) for _,df in g])
# ffill column b
df2['b'] = df2['b'].ffill()


            b    c
2020-01-01  a  1.0
2020-01-02  a  NaN
2020-01-03  a  NaN
2020-01-04  a  NaN
2020-01-05  a  NaN
2020-01-06  a  2.0
2020-01-04  b  3.0
2020-01-05  b  NaN
2020-01-06  b  NaN
2020-01-07  b  4.0
It_is_Chris
  • 13,504
  • 2
  • 23
  • 41
  • How would you do that if you want a specific date delta, ie from 2020-01-01 to 2021-04-01 for each value of the column 'b'? I mean same dates (in that range) for a and b. – Roy May 09 '21 at 04:12
3

Another approach with groupby and asfreq:

(df.set_index('a')
   .groupby('b').apply(lambda x: x.drop('b',axis=1).asfreq('D'))
   .reset_index()
)

Output:

   b          a    c
0  a 2020-01-01  1.0
1  a 2020-01-02  NaN
2  a 2020-01-03  NaN
3  a 2020-01-04  NaN
4  a 2020-01-05  NaN
5  a 2020-01-06  2.0
6  b 2020-01-04  3.0
7  b 2020-01-05  NaN
8  b 2020-01-06  NaN
9  b 2020-01-07  4.0
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
0

We could use the complete function from pyjanitor, which provides a convenient abstraction to generate the missing rows :

# pip install pyjanitor
import pandas as pd
import janitor as jn

df['a'] = pd.to_datetime(df['a'])

# create a mapping for the new dates
dates = {"a" : lambda a : pd.date_range(a.min(), a.max(), freq='1D')}

# create the new dataframe, exposing the missing rows, per group:

df.complete(dates, by='b', sort = True)
 
           a  b    c
0 2020-01-01  a  1.0
1 2020-01-02  a  NaN
2 2020-01-03  a  NaN
3 2020-01-04  a  NaN
4 2020-01-05  a  NaN
5 2020-01-06  a  2.0
6 2020-01-04  b  3.0
7 2020-01-05  b  NaN
8 2020-01-06  b  NaN
9 2020-01-07  b  4.0

An alternative approach to bypass the by parameter, and possibly improve performance, is shown below :

# build the mapping for the entire dataframe, ignoring the groups
dates = {"a" : pd.date_range(df.a.min(), df.a.max(), freq='1D')}

# create a groupby object
grouped = df.groupby('b').a

 # create temporary columns for the min and max dates
(df.assign(date_min = grouped.transform('min'), 
           date_max = grouped.transform('max'))
    # expose the missing rows based on the combination of dates
    # and the tuple of b, date_min and date_max
   .complete(dates, ('b', 'date_min', 'date_max'))
    # filter for rows, this will keep only the relevant dates
   .loc[lambda df: df.a.between(df.date_min, df.date_max), 
        df.columns]
   .sort_values(['b', 'a'], ignore_index = True)
)

           a  b    c
0 2020-01-01  a  1.0
1 2020-01-02  a  NaN
2 2020-01-03  a  NaN
3 2020-01-04  a  NaN
4 2020-01-05  a  NaN
5 2020-01-06  a  2.0
6 2020-01-04  b  3.0
7 2020-01-05  b  NaN
8 2020-01-06  b  NaN
9 2020-01-07  b  4.0

Performance will be significant/better for large dataframes. A similar example with tests is offered here

sammywemmy
  • 27,093
  • 4
  • 17
  • 31