Convert pandas df with data in a "list column" into a time series in long format. Use three columns: [list of data] + [timestamp] + [duration]

Question

The aim is to convert a dataframe with a list column as the data column (and thus with just one timestamp and duration per row) into a time series in long format with a datetimeindex for each single item.

In the result, there is no sequence/list per row for the data anymore, but just one value column.

df_test = pd.DataFrame({'timestamp': [1462352000000000000, 1462352100000000000, 1462352200000000000, 1462352300000000000],
                        'list': [[1,2,1,9], [2,2,3,0], [1,3,3,0], [1,1,3,9]],
                        'duration_sec': [3.0, 3.0, 3.0, 3.0]})

tdi = pd.DatetimeIndex(df_test.timestamp)
df_test.set_index(tdi, inplace=True)
df_test.drop(columns='timestamp', inplace=True)
df_test.index.name = 'datetimeindex'

Out:

                       list          duration_sec
datetimeindex                                      
2016-05-04 08:53:20  [1, 2, 1, 9]           3.0
2016-05-04 08:55:00  [2, 2, 3, 0]           3.0
2016-05-04 08:56:40  [1, 3, 3, 0]           3.0
2016-05-04 08:58:20  [1, 1, 3, 9]           3.0

The aim is:

                   value
datetimeindex
2016-05-04 08:53:20  1
2016-05-04 08:53:21  2
2016-05-04 08:53:22  1
2016-05-04 08:53:23  9
2016-05-04 08:55:00  2
2016-05-04 08:55:01  2
2016-05-04 08:55:02  3
2016-05-04 08:55:03  0
2016-05-04 08:56:40  1
2016-05-04 08:56:41  3
2016-05-04 08:56:42  3
2016-05-04 08:56:43  0
2016-05-04 08:58:20  1
2016-05-04 08:58:21  1
2016-05-04 08:58:22  3
2016-05-04 08:58:23  9

Mind that this means not just to take 1 second for each item; this was just taken to simplify the example. Instead, it is about 4 items in a sequence that has a given duration of, for example, 3.0 seconds (which may also vary from row to row), and where the first item of each sequence always starts at "time 0", meaning that the seconds per item should be calculated like

[3.0 sec / (4-1) items] = 1 sec.

Context:

The example shows conversion to Datetimeindex since this makes it suitable for seasonal_decompose(), see this the first search hit.

There, the resulting df looks like this:

df_test2 = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv', parse_dates=['date'], index_col='date')

Out:

                value
date                 
1991-07-01   3.526591
1991-08-01   3.180891
1991-09-01   3.252221
1991-10-01   3.611003
1991-11-01   3.565869
              ...
2008-02-01  21.654285
2008-03-01  18.264945
2008-04-01  23.107677
2008-05-01  22.912510
2008-06-01  19.431740

[204 rows x 1 columns]

And then it is easy to apply a seasonal_decompose() via additive decomposition model:

result_add = seasonal_decompose(df_test2['value'], model='additive', extrapolate_trend='freq')

# Plot
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

Now the same is needed for the df_test above.

Does the duration_sec column have any significance in this question? — EddyG, Aug 04 '20 at 12:33
@EddyG It is needed because you somehow need to find out which timestamp is the next after the "start" which you only know once for each row. I do not have any other information about the time between the items in each sequence. Thus the next timestamp can only be calculated by "+duration / len(sequence)" I think. — questionto42, Aug 04 '20 at 12:38

score 3 · Accepted Answer · edited Aug 05 '20 at 05:56

3

Use DataFrame.explode first and then add counter by GroupBy.cumcount and to_timedelta to df.index:

df_test = df_test.explode('nestedList')
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount(), unit='s')

print (df_test)
                    nestedList  duration_sec
2016-05-04 08:53:20          1           3.0
2016-05-04 08:53:21          2           3.0
2016-05-04 08:53:22          1           3.0
2016-05-04 08:53:23          9           3.0
2016-05-04 08:55:00          2           3.0
2016-05-04 08:55:01          2           3.0
2016-05-04 08:55:02          3           3.0
2016-05-04 08:55:03          0           3.0
2016-05-04 08:56:40          1           3.0
2016-05-04 08:56:41          3           3.0
2016-05-04 08:56:42          3           3.0
2016-05-04 08:56:43          0           3.0
2016-05-04 08:58:20          1           3.0
2016-05-04 08:58:21          1           3.0
2016-05-04 08:58:22          3           3.0
2016-05-04 08:58:23          9           3.0

EDIT:

df_test = df_test.explode('nestedList') 
sizes = df_test.groupby(level=0)['nestedList'].transform('size').sub(1)
duration = df_test['duration_sec'].div(sizes) 
df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount() * duration, unit='s')

EDIT2 by asker:

With the resulting df this simple application of decompose() is now possible, which was the final aim:

result_add = seasonal_decompose(x=df_test['nestedList'], model='additive', extrapolate_trend='freq', period=int(len(df_test)/2))
plt.rcParams.update({'figure.figsize': (5,5)})
result_add.plot().suptitle('Additive Decompose', fontsize=22)
plt.show()

edited Aug 05 '20 at 05:56

questionto42

7,175
4
57
90

answered Aug 04 '20 at 12:33

jezrael

822,522
95
1,334
1,252

As to @EddyG's comment, this is only counting seconds up at the moment. If I have 50 items in the sequence, and the duration of 30 seconds, this would not work anymore. I would have to add 30/50s instead of a full second for each count. I have accepted this too early now, my question was not clear about this. Though the answer now is already brilliant regarding the few minutes that the question was posted. – questionto42 Aug 04 '20 at 12:50
1

@Lorenz - One thing is not clear for me, there is 4 values in sublists, but there is `duration_sec` column divided by `3` ? So it means by default is subtract length by `1` ? So then solution is `df_test = df_test.explode('nestedList')` `sizes = df_test.groupby(level=0)['nestedList'].transform('size').sub(1).div(df_test['duration_sec'])` `df_test.index += pd.to_timedelta(df_test.groupby(level=0).cumcount() * sizes, unit='s')` – jezrael Aug 04 '20 at 12:53
The duration is 3.0 sec for all 4 items (but just an example that I chose), I thought that this would be it if the first item would start at once, followed by the rest in distances of 1 sec. I guess this is confusing, I should change it. I have just taken that to get a simple example. In reality, the time per item is much lower. Let us assume each item would take 0.05 seconds, could you help me out how that would be grouped by? I guess that is what you offer here with `* sizes`, let me check. Brilliant, it works. Could you add this in the answer, and I change the question to "sizes" variable? – questionto42 Aug 04 '20 at 13:01
@Lorenz - Is possible change data sample for another duration with expected output? – jezrael Aug 04 '20 at 13:03
1

I will change the question so that duration will not count a role anymore, but a generally set duration per item, in whatever sequence it may be. Give me a few minutes. – questionto42 Aug 04 '20 at 13:05
1

OK, that is done, now the global duration variable is included in the question. Please just change your idea using `groupby(level=0).cumcount() * sizes` to `groupby(level=0).cumcount() * duration` in your answer, introduce `duration = 1` right before, and delete the now unneeded duration_sec column of the output. Thx – questionto42 Aug 04 '20 at 13:16
1

I will check. I will let you know if this solves it, it might help deriving the duration per item what you seem to be offering now. Seems promising, give me some time. – questionto42 Aug 04 '20 at 13:24
You must have it the other way round, `df_test['duration_sec'].div(sizes)` would be the duration per item. The rest should be OK, the check is plausible. You rightly ignore one item on purpose so that you get for example: "sequence duration of 5 sec" & "sequence of 4 items" --> "item duration of 5/3 sec". I will have to check this in the bigger dataset, but it should be right like this. I will roll back the question to have the duration_sec column again as a requirement. – questionto42 Aug 04 '20 at 13:38

Convert pandas df with data in a "list column" into a time series in long format. Use three columns: [list of data] + [timestamp] + [duration]

1 Answers1

Linked