0

I have a pandas dataframe df:

    Date                Activity Vector
0   2017-03-01T15:20:00 [0.0366666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
1   2017-03-01T15:25:00 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2   2017-03-01T15:45:00 [0.163333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
3   2017-03-01T15:50:00 [0.316666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
4   2017-03-01T15:55:00 [0.0666666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
5   2017-03-01T16:00:00 [0.123333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
6   2017-03-01T16:05:00 [0.0333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
7   2017-03-01T16:10:00 [0.356666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
8   2017-03-01T16:15:00 [0.476666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
9   2017-03-01T16:20:00 [0.113333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
10  2017-03-01T16:50:00 [0.0733333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...

This data is a time series with some missing values (note, the Date column has type str).

I would like to reindex this dataframe and fill the missing entries with a numpy vector of zeros, np.zeros(15)

I've tried the following:

df = data.clean_df[['Date', 'Activity Vector']]
df['timestamp'] = pd.to_datetime(df['Date'])
# print(df.dtypes)
df = df.set_index('timestamp').resample('300S').ffill()

which gives me the following:

    timestamp           Date                Activity Vector
0   2017-03-01 15:20:00 2017-03-01T15:20:00 [0.0366666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
1   2017-03-01 15:25:00 2017-03-01T15:25:00 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
2   2017-03-01 15:30:00 2017-03-01T15:25:00 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
3   2017-03-01 15:35:00 2017-03-01T15:25:00 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
4   2017-03-01 15:40:00 2017-03-01T15:25:00 [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...
5   2017-03-01 15:45:00 2017-03-01T15:45:00 [0.163333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
6   2017-03-01 15:50:00 2017-03-01T15:50:00 [0.316666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
7   2017-03-01 15:55:00 2017-03-01T15:55:00 [0.0666666666667, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
8   2017-03-01 16:00:00 2017-03-01T16:00:00 [0.123333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0,...
9   2017-03-01 16:05:00 2017-03-01T16:05:00 [0.0333333333333, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0...
10  2017-03-01 16:10:00 2017-03-01T16:10:00 [0.356666666667, 0.0, 0.0, 0.0, 

However this fills the missing samples with the previous entry via the ffill, how can I instead fill the new rows with custom entries, for example with Date being anything (doesn't matter as it will be dropped later) but Activity Vector being filled with a numpy vector of zeros, np.zeros(15)

  • You can use fillna: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html – taras Jul 22 '17 at 20:23
  • 2
    please, make the sample data smaller. this is very hard to read and you could make the point just as easily with a tiny dataframe. https://stackoverflow.com/help/mcve and https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – JohnE Jul 22 '17 at 21:07
  • Thanks for the advice, edited out most samples. Sorry about that. – Abdel Wahab Turkmani Jul 23 '17 at 12:47

1 Answers1

1

Since you say Date being anything (doesn't matter as it will be dropped later), instead of ffill you can use asfreq and then fill the NaN data with the desired list or string you want.

If you want the numpy array as a string you can use str. If you want the array as a list like your example you can use ast.literal_eval()

import ast
df['timestamp'] = pd.to_datetime(df['Date'])
# print(df.dtypes)
df = df.set_index('timestamp').resample('300S').asfreq()
df['Activity Vector'] = df['Activity Vector'].fillna(str(np.zeros(15).tolist())).apply(str)
df['Activity Vector'] = df['Activity Vector'].apply(ast.literal_eval)

Hope this helps.

Bharath M Shetty
  • 30,075
  • 6
  • 57
  • 108