1

I am trying to add missing dates to my dataframe.

I have seen this posts: reindex and reindex2.

When I try to reindex my dataframe:

print(df)
df = df.reindex(dates, fill_value=0)
print(df)

I get the following output:

_updated_at         Name        hour day date       time      data1     data2
06/06/2016 13:27    game_name   13  6    06/06/2016 evening   0         0
07/06/2016 10:33    game_name   10  7    07/06/2016 morning   145.2788  122.7361
18/10/2016 14:34    game_name   14  18   18/10/2016 evening   0         0
19/10/2016 17:12    game_name   17  19   19/10/2016 evening   0         0
24/10/2016 11:05    game_name   11  24   24/10/2016 morning   313.5954  364.4107
24/10/2016 12:02    game_name   12  24   24/10/2016 evening   0         0
25/10/2016 08:50    game_name   8   25   25/10/2016 morning   362.4682  431.5803
25/10/2016 13:00    game_name   13  25   25/10/2016 evening   0         0


_updated_at Name hour day date  time data1  data2
24/10/2016  0    0    0   0     0    0      0
25/10/2016  0    0    0   0     0    0      0
26/10/2016  0    0    0   0     0    0      0
27/10/2016  0    0    0   0     0    0      0
28/10/2016  0    0    0   0     0    0      0
29/10/2016  0    0    0   0     0    0      0
30/10/2016  0    0    0   0     0    0      0

I am expecting to see the rows where a date is missing filled with the new row and 0's in each value, rather than all rows replaced with 0.

EDIT: The overall goal is to be able to calculate the difference between values resulting in a morning and evening diff on a per day basis.

EDIT2: Current output:

print (df.reindex(mux, fill_value=0).groupby(level=0)['data1'].diff(-1).dropna())

dtypes: float64(2)None
2016-06-06  morning       0.00000
2016-06-07  morning     440.99582
2016-06-08  morning       0.00000
2016-06-09  morning       0.00000
2016-06-10  morning       0.00000

print (df.reindex(mux, fill_value=0).groupby(level=0)['data2'].diff(-1).dropna())

Length: 142, dtype: float64
2016-06-06  morning    -220.5481
2016-06-07  morning       0.0000
2016-06-08  morning       0.0000
2016-06-09  morning       0.0000
2016-06-10  morning       0.0000
2016-06-11  morning       0.0000

I was expecting to see evening values

Community
  • 1
  • 1
archienorman
  • 1,434
  • 3
  • 20
  • 36

1 Answers1

1

You can reindex by MultiIndex.from_product from columns dates and time:

df.date = pd.to_datetime(df.date)
dates = pd.date_range(start=df.date.min(), end=df.date.max())
print (dates)
DatetimeIndex(['2016-06-06', '2016-06-07', '2016-06-08', '2016-06-09',
               '2016-06-10', '2016-06-11', '2016-06-12', '2016-06-13',
               '2016-06-14', '2016-06-15',
               ...
               '2016-10-16', '2016-10-17', '2016-10-18', '2016-10-19',
               '2016-10-20', '2016-10-21', '2016-10-22', '2016-10-23',
               '2016-10-24', '2016-10-25'],
              dtype='datetime64[ns]', length=142, freq='D')

mux = pd.MultiIndex.from_product([dates,['morning','evening']])
#print (mux)

df.set_index(['date','time'], inplace=True)

print (df.reindex(mux, fill_value=0))
                         _updated_at       Name  hour  day     data1     data2
2016-06-06 morning                 0          0     0    0    0.0000    0.0000
           evening  06/06/2016 13:27  game_name    13    6    0.0000    0.0000
2016-06-07 morning                 0          0     0    0    0.0000    0.0000
           evening                 0          0     0    0    0.0000    0.0000
2016-06-08 morning                 0          0     0    0    0.0000    0.0000
           evening                 0          0     0    0    0.0000    0.0000
2016-06-09 morning                 0          0     0    0    0.0000    0.0000
           evening                 0          0     0    0    0.0000    0.0000
2016-06-10 morning                 0          0     0    0    0.0000    0.0000
           evening                 0          0     0    0    0.0000    0.0000
2016-06-11 morning                 0          0     0    0    0.0000    0.0000
           evening                 0          0     0    0    0.0000    0.0000
2016-06-12 morning                 0          0     0    0    0.0000    0.0000
           evening                 0          0     0    0    0.0000    0.0000
2016-06-13 morning                 0          0     0    0    0.0000    0.0000
...

Last you can groupby by first level of Multiindex (dates) with DataFrameGroupBy.diff. You get for each dates row with NaN which can be removed by dropna:

print (df.reindex(mux, fill_value=0).groupby(level=0)['data1','data2'].diff(-1).dropna())
                       data1     data2
2016-06-06 morning    0.0000    0.0000
2016-06-07 morning    0.0000    0.0000
2016-06-08 morning    0.0000    0.0000
2016-06-09 morning    0.0000    0.0000
2016-06-10 morning    0.0000    0.0000
2016-06-11 morning    0.0000    0.0000
2016-06-12 morning    0.0000    0.0000
2016-06-13 morning    0.0000    0.0000
2016-06-14 morning    0.0000    0.0000
2016-06-15 morning    0.0000    0.0000
2016-06-16 morning    0.0000    0.0000
2016-06-17 morning    0.0000    0.0000
2016-06-18 morning    0.0000    0.0000
2016-06-19 morning    0.0000    0.0000
2016-06-20 morning    0.0000    0.0000
2016-06-21 morning    0.0000    0.0000
...
...

You can also select by ix and subtract:

print (df.reindex(mux, fill_value=0)
         .groupby(level=0)
         .apply(lambda x: x.ix[0, ['data1','data2']]-x.ix[1, ['data1','data2']]))

               data1     data2
2016-06-06    0.0000    0.0000
2016-06-07    0.0000    0.0000
2016-06-08    0.0000    0.0000
2016-06-09    0.0000    0.0000
2016-06-10    0.0000    0.0000
2016-06-11    0.0000    0.0000
2016-06-12    0.0000    0.0000
2016-06-13    0.0000    0.0000
2016-06-14    0.0000    0.0000
2016-06-15    0.0000    0.0000
2016-06-16    0.0000    0.0000
2016-06-17    0.0000    0.0000
2016-06-18    0.0000    0.0000
2016-06-19    0.0000    0.0000
2016-06-20    0.0000    0.0000
2016-06-21    0.0000    0.0000
2016-06-22    0.0000    0.0000
2016-06-23    0.0000    0.0000
2016-06-24    0.0000    0.0000
2016-06-25    0.0000    0.0000
2016-06-26    0.0000    0.0000
2016-06-27    0.0000    0.0000
2016-06-28    0.0000    0.0000
2016-06-29    0.0000    0.0000
2016-06-30    0.0000    0.0000
2016-07-01    0.0000    0.0000
2016-07-02    0.0000    0.0000
2016-07-03    0.0000    0.0000
2016-07-04    0.0000    0.0000
2016-07-05    0.0000    0.0000
             ...       ...
2016-09-26    0.0000    0.0000
2016-09-27    0.0000    0.0000
2016-09-28    0.0000    0.0000
2016-09-29    0.0000    0.0000
2016-09-30    0.0000    0.0000
2016-10-01    0.0000    0.0000
2016-10-02    0.0000    0.0000
2016-10-03    0.0000    0.0000
2016-10-04    0.0000    0.0000
2016-10-05    0.0000    0.0000
2016-10-06    0.0000    0.0000
2016-10-07    0.0000    0.0000
2016-10-08    0.0000    0.0000
2016-10-09    0.0000    0.0000
2016-10-10    0.0000    0.0000
2016-10-11    0.0000    0.0000
2016-10-12    0.0000    0.0000
2016-10-13    0.0000    0.0000
2016-10-14    0.0000    0.0000
2016-10-15    0.0000    0.0000
2016-10-16    0.0000    0.0000
2016-10-17    0.0000    0.0000
2016-10-18    0.0000    0.0000
2016-10-19    0.0000    0.0000
2016-10-20    0.0000    0.0000
2016-10-21    0.0000    0.0000
2016-10-22    0.0000    0.0000
2016-10-23    0.0000    0.0000
2016-10-24  313.5954  364.4107
2016-10-25  362.4682  431.5803

[142 rows x 2 columns]
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • is there no way of ignoring duplicates when using reindex? – archienorman Oct 26 '16 at 12:12
  • unfortunately i dont think the mean will work, you will see there is a column called 'time' which specifies if its in the morning or evening. i need to keep these rows separate for intraday calculations – archienorman Oct 26 '16 at 12:26
  • Yes, this is another problem. It is called [automatic exclusion of nuisance columns](http://pandas.pydata.org/pandas-docs/stable/groupby.html#automatic-exclusion-of-nuisance-columns). So it replace only if numeric data. But question is if for some dates are `morning` and `evening` in `time` column, what is desired output after aggreagtion? Both values, or only first or only last? – jezrael Oct 26 '16 at 12:47
  • the goal is to compute the difference of the values on an intraday basis. (current morning value = current morning value - previous afternoon data) I hope to be left with a morning diff value and evening diff value for each day. Therefore I would only keep the most recent record in the morning and the most recent in the evening. – archienorman Oct 26 '16 at 12:51
  • Hmmm, I have an idea. – jezrael Oct 26 '16 at 12:58
  • This is great! Thank you! One issue, the last line returns this error: `TypeError: unsupported operand type(s) for -: 'int' and 'unicode'` Assume i have mismatched types in my columns? – archienorman Oct 26 '16 at 13:22
  • What is `print (df[['data1','data2']].info()` ? – jezrael Oct 26 '16 at 13:25
  • `8 non-null float64` - I turned the columns into float when i created them (just after receiving that error) – archienorman Oct 26 '16 at 13:27
  • I have no idea. I think you have no float in columns. If remove `dropna()` still error? – jezrael Oct 26 '16 at 13:28
  • removing `dropna()` doesnt help and checked that all values in each column are a float. which is weird – archienorman Oct 26 '16 at 13:34
  • Do you use `python 2` or `3`? – jezrael Oct 26 '16 at 13:40
  • using python 2.7 - just tried: `df['data1'] = df['data1'].astype(float)`, didn't work – archienorman Oct 26 '16 at 13:42
  • I was afraid about it. It seems some bad coding issue. This return errors also? `print (df.reindex(mux, fill_value=0).groupby(level=0)['data1'].diff(-1).dropna())` and `print (df.reindex(mux, fill_value=0).groupby(level=0)['data2'].diff(-1).dropna())` ? – jezrael Oct 26 '16 at 13:45
  • works now: however it shows morning in both cases, rather than one morning and one evening. also I'm not sure this takes into account that morning = (morning - previous evening) - updating question to show the output – archienorman Oct 26 '16 at 13:51
  • Yes, it is desired output from diff - it subtract `morning` data with `evening`. But it also wants subtract `evening` with `non exist`, so get `NaN`. But if need same data for evening and morning use `print (df.reindex(mux, fill_value=0).groupby(level=0)['data1','data2'].diff(-1).ffill())` – jezrael Oct 26 '16 at 13:56
  • afraid i can't get this to work `print (df.reindex(mux, fill_value=0).groupby(level=0)['data1','data2'].diff(-1).ffi‌​ll())` - perhaps best just iterate over and calculate the diffs. – archienorman Oct 26 '16 at 14:06
  • Yes, give me a sec. – jezrael Oct 26 '16 at 14:07
  • I add another solution, please check it. – jezrael Oct 26 '16 at 14:11
  • Thank you so much! – archienorman Oct 26 '16 at 14:37
  • Ya, it was complicated, but I am glad I can help you. But it was not easy. Thanks for accepting and nice day! – jezrael Oct 26 '16 at 14:39