0

Suppose that I have a dataframe which can be created using code below

df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-05'],
                          'product':['A', 'A', 'A', 'B', 'B', 'B'],
                          'price':[10, 20, 30, 40, 50, 60]
                         }
                 )
df['date'] = pd.to_datetime(df['date'])

I want to create an empty dataframe let's say main_df which will contain all dates between df.date.min() and df.date.max() for each product and on days where values in nan I want to ffill and bfill for remaning. The resulting dataframe would be as below:

+------------+---------+-------+
|    date    | product | price |
+------------+---------+-------+
| 2021-01-01 | A       |    10 |
| 2021-01-02 | A       |    20 |
| 2021-01-03 | A       |    20 |
| 2021-01-04 | A       |    20 |
| 2021-01-05 | A       |    30 |
| 2021-01-01 | B       |    40 |
| 2021-01-02 | B       |    40 |
| 2021-01-03 | B       |    50 |
| 2021-01-04 | B       |    50 |
| 2021-01-05 | B       |    60 |
+------------+---------+-------+
Anoushiravan R
  • 21,622
  • 3
  • 18
  • 41
Lopez
  • 461
  • 5
  • 19
  • 1
    You would like to only keep dates that are between the max and min dates of each group, so how come they are still included in your desired output? – Anoushiravan R Nov 19 '22 at 11:14
  • It is your assumption that I want to keep `min` and `max` of each group. As stated in OP `df.date.min()` is global `min` similarly `df.date.max()` is global `max` – Lopez Nov 19 '22 at 11:19
  • Does `df.date.min()` computes `min` date for each group? – Lopez Nov 19 '22 at 11:21
  • "Between" means boundary values inclusive. NOT "Exclusive". The resulting output is shown to be clear on what is required not to argue. – Lopez Nov 19 '22 at 11:27
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/249741/discussion-between-lopez-and-anoushiravan-r). – Lopez Nov 19 '22 at 11:28

2 Answers2

2

First

make pivot table, upsampling by asfreq and fill null

df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()

output:

product     A       B
date        
2021-01-01  10.0    40.0
2021-01-02  20.0    40.0
2021-01-03  20.0    50.0
2021-01-04  20.0    50.0
2021-01-05  30.0    60.0



Second

stack result and so on (include full code)

(df.pivot_table('price', 'date', 'product').asfreq('D').ffill().bfill()
 .stack().reset_index().rename(columns={0:'price'})
 .sort_values('product').reset_index(drop=True))

output:

    date        product price
0   2021-01-01  A       10.0
1   2021-01-02  A       20.0
2   2021-01-03  A       20.0
3   2021-01-04  A       20.0
4   2021-01-05  A       30.0
5   2021-01-01  B       40.0
6   2021-01-02  B       40.0
7   2021-01-03  B       50.0
8   2021-01-04  B       50.0
9   2021-01-05  B       60.0
Panda Kim
  • 6,246
  • 2
  • 12
0

Using resample

df = pd.DataFrame(data = {'date':['2021-01-01', '2021-01-02', '2021-01-05','2021-01-02', '2021-01-03', '2021-01-06'],
                                'product':['A', 'A', 'A', 'B', 'B', 'B'],
                                'price':[10, 20, 30, 40, 50, 60]
                               }
                       )
df['date'] = pd.to_datetime(df['date'])

df
# Out: 
#          date product  price
# 0  2021-01-01       A     10
# 1  2021-01-02       A     20
# 2  2021-01-05       A     30
# 3  2021-01-02       B     40
# 4  2021-01-03       B     50
# 5  2021-01-06       B     60




df.set_index("date").groupby("product")["price"].resample("d").ffill().reset_index()
# Out: 
#   product       date  price
# 0       A 2021-01-01     10
# 1       A 2021-01-02     20
# 2       A 2021-01-03     20
# 3       A 2021-01-04     20
# 4       A 2021-01-05     30
# 5       B 2021-01-02     40
# 6       B 2021-01-03     50
# 7       B 2021-01-04     50
# 8       B 2021-01-05     50
# 9       B 2021-01-06     60

See the rows that have been filled by ffill:

df.set_index("date").groupby("product")["price"].resample("d").mean()
# Out: 
# product  date      
# A        2021-01-01    10.0
#          2021-01-02    20.0
#          2021-01-03     NaN
#          2021-01-04     NaN
#          2021-01-05    30.0
# B        2021-01-02    40.0
#          2021-01-03    50.0
#          2021-01-04     NaN
#          2021-01-05     NaN
#          2021-01-06    60.0
# Name: price, dtype: float64

Note that by grouping by product before resampling and filling the empty slots, you can have different ranges (from min to max) for each product (I modified the data to showcase this).

user2314737
  • 27,088
  • 20
  • 102
  • 114