How to format a data set for time series prediction in H2O's Driverless AI

Question

For simplicity, say that I am attempting to predict the following day of a sequence of single-valued variables, therefore my datasaet would be in the form of:

input    label
   x1       x2
   x2       x3
   x3       x4
  ...      ...
   xt      xt+1

However, my data has the same sequences in time for many different users, therefore it is in the following form:

input    label
 u1x1     u1x2
 u1x2     u1x3
 u1x3     u1x4
  ...      ...
 u1xt   u1xt+1
 u2x1     u2x2
 u2x2     u2x3
 u2x3     u2x4
  ...      ...
 u2xt   u2xt+1
  ...      ...
 unx1     unx2
 unx2     unx3
 unx3     unx4
  ...      ...
 unxt   unxt+1

What is an acceptable way to structure this data and feed it into DAI such that it is not treated as one entire long sequence, but rather a bunch of not directly related sequences parallel in time?

Edit: The data has a 'UserID' column. Can DAI automatically use this to overcome the problem I am explaining?

Lauren · Answer 1 · 2018-08-20T18:57:40.440

1

To format your data for forecasting, you need to aggregate your data for each group of interest and for a specific time period (in your case one day).

So if your forecast horizon is one day, you need to aggregate by user, your single-valued variable, and by day so that you have a target (label) as a total amount per day. You can find documentation on how to setup your data for driverless here and here.

EDIT in response to comment:

Here is another example to explain the expected data format using the assumption that each user should be aggregated at the day level:

If you have one day’s worth of data for 5 users your dataset should only have 5 rows, but if you have 10 days worth of data for 5 users you should have 50 rows of data.

Then in Driverless AI when you set up your experiment you would set your Time Group to the User column

edited Aug 20 '18 at 18:57

answered Aug 17 '18 at 21:54

Lauren

5,640
1
13
19

Hi Lauren. Thanks for the explanation and the links. So from what I gather, the sample data I provided is already formatted correctly to be fed into DAI? It has the entire time-series sequence of length t of data for user 1 in the first t rows, followed by the t rows of data for user 2, etc. – KOB Aug 20 '18 at 13:11
@KOB updated the answer to clarify the answer. The new example should help clarify why the sample data you provided is not yet formatted correctly for DAI. – Lauren Aug 20 '18 at 18:58
I still do not understand how this is different to my original example. In my example, I have t days of data for n users, hence the data set has txn (t times n) rows. Therefore, if I had t=10 days of data, and n=5 users, my data set would have 50 rows, just like you have suggested. – KOB Aug 20 '18 at 19:05
1

I see, I misunderstood your t units. Yes your sample is in the correct format then. – Lauren Aug 20 '18 at 20:10

How to format a data set for time series prediction in H2O's Driverless AI

1 Answers1