0

I am working with data that arrives as a list of dicts with each dict having a timestamp for each minute of the day, and a value:

dataset = [{'time':00:00:00, 'value':0}, {'time':00:00:01, 'value':0},...]

I collect a week's worth of this data, for which I resample to 1Hr intervals and average for each hour for that week.

for dic in dataset:
    time_list.append(dic['time'])
    value_list.append(dic['value'])

df1 = pd.DataFrame({'values': value_list, 'time': time_list})
df1['time'] = pd.to_datetime(df1['time'])
df1 = df1.set_index('time')
df1 = dframe.resample('1H').sum()
array_ls.append(df1)

This leaves me with a list of 7 series which I can create a new df from and average

for i in range(0, len(array_ls)):
        dfnew[header_list[i]] = ls_of_arrays[i]
dfnew['means']=dfnew.mean(axis=1, skipna=True)

All this works perfectly so long as I have full data sets. However, when the first array in the list is truncated (i.e there is missing data), when I add the other arrays to 'dfnew' they become truncated at the length of the first array.

For example, if after resampling the first array only has values from midnight to 11 am (ie 11 values), while the others are full arrays (i.e. 24 values for a full day) the final 'dfnew' data frame will only have 11 values for all of the columns. I think this is because it sets the index from the first column imported and ignores any data with an index outside of that range.

What I would like to do is to get the sum value for each hour, replace it with NaN if there are no values for that hour, and then ignore that hour when averaging across the week (so that the average for 11 am might include 7 days data, but the average for 12pm might only include 6).

Currently, I sort the list of arrays by length and add them to the data frame from longest to shortest. This works, but if all longest arrays are missing timestamps, then I still have missing data points (it also feels like a hack!).

Is there a better / easier way to accomplish this?

EDIT An example as requested. If you run the code below you end up with a dataframe of 16,2 even though the second df is 100 items long. (If you reverse the order they are added to dfnew it is 100 items long). This example is not resampled data, but the issue is the same in that data beyond the index of the first imported array is ignored.

import pandas as pd
ls = []

t1  = [
'00:00:00', '00:01:00', '00:02:00', '00:03:00', '00:04:00', '00:05:00', '00:06:00', '00:07:00', '00:08:00', '00:09:00', '00:10:00', '00:11:00', '00:12:00', '00:13:00', '00:14:00', '00:15:00']

val1 = [52, 52, 54, 54, 54, 58, 53, 52, 59, 61, 58, 60, 61, 58, 52, 54]

t2  = ['00:00:00', '00:01:00', '00:02:00', '00:03:00', '00:04:00', '00:05:00', '00:06:00', '00:07:00', '00:08:00', '00:09:00', '00:10:00', '00:11:00', '00:12:00', '00:13:00', '00:14:00', '00:15:00', '00:16:00', '00:17:00', '00:18:00', '00:19:00', '00:20:00', '00:21:00', '00:22:00', '00:23:00', '00:24:00', '00:25:00', '00:26:00', '00:27:00', '00:28:00', '00:29:00', '00:30:00', '00:31:00', '00:32:00', '00:33:00', '00:34:00', '00:35:00', '00:36:00', '00:37:00', '00:38:00', '00:39:00', '00:40:00', '00:41:00', '00:42:00', '00:43:00', '00:44:00', '00:45:00', '00:46:00', '00:47:00', '00:48:00', '00:49:00', '00:50:00', '00:51:00', '00:52:00', '00:53:00', '00:54:00', '00:55:00', '00:56:00', '00:57:00', '00:58:00', '00:59:00', '01:00:00', '01:01:00', '01:02:00', '01:03:00', '01:04:00', '01:05:00', '01:06:00', '01:07:00', '01:08:00', '01:09:00', '01:10:00', '01:11:00', '01:12:00', '01:13:00', '01:14:00', '01:15:00', '01:16:00', '01:17:00', '01:18:00', '01:19:00', '01:20:00', '01:21:00', '01:22:00', '01:23:00', '01:24:00', '01:25:00', '01:26:00', '01:27:00', '01:28:00', '01:29:00', '01:30:00', '01:31:00', '01:32:00', '01:33:00', '01:34:00', '01:35:00', '01:36:00', '01:37:00', '01:38:00', '01:39:00']

val2 = [52, 52, 54, 54, 54, 58, 53, 52, 59, 61, 58, 60, 61, 58, 52, 54, 53, 54, 61, 53, 56, 56, 58, 56, 69, 69, 60, 53, 55, 53, 53, 53, 54, 56, 53, 53, 53, 52, 53, 53, 53, 54, 54, 54, 53, 53, 53, 54, 54, 54, 54, 54, 55, 55, 54, 55, 53, 54, 54, 55, 54, 53, 53, 53, 54, 54, 53, 53, 54, 54, 54, 55, 56, 53, 55, 52, 52, 51, 48, 49, 48, 48, 49, 49, 49, 50, 49, 48, 48, 48, 48, 49, 49, 48, 48, 49, 50, 49, 50, 49]

df1 = pd.DataFrame({'values': val1, 'time': t1})
df1['time'] = pd.to_datetime(df1['time'])
df1 = df1.set_index('time')
ls.append(df1)

df2 = pd.DataFrame({'values': val2, 'time': t2})
df2['time'] = pd.to_datetime(df2['time'])
df2 = df2.set_index('time')
ls.append(df2)

dfnew = pd.DataFrame()
dfnew[['vals1']]=ls[0]
dfnew[['vals2']]=ls[1]
print(dfnew.shape)
its_broke_again
  • 319
  • 4
  • 12
  • 1
    Could you add some example data that illustrates the problem (i.e., a list of different length series)? That would make it easier for testing, see: https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Shaido Oct 19 '21 at 06:27
  • 1
    Hi, have added some minimally viable code to highlight the issue. Thanks – its_broke_again Oct 19 '21 at 07:21
  • 1
    You can try to use `pd.concat(ls, axis=1)`, I think it should give you the wanted result independent of the order of the series. – Shaido Oct 19 '21 at 07:24
  • Ahhh!! Thank you. The data I work with means I only ever add to df using `df[['name']] = a_list` notation. Of course, concat is the answer. – its_broke_again Oct 19 '21 at 09:09

0 Answers0