1

For simplicity, say that I am attempting to predict the following day of a sequence of single-valued variables, therefore my datasaet would be in the form of:

input    label
   x1       x2
   x2       x3
   x3       x4
  ...      ...
   xt      xt+1

However, my data has the same sequences in time for many different users, therefore it is in the following form:

input    label
 u1x1     u1x2
 u1x2     u1x3
 u1x3     u1x4
  ...      ...
 u1xt   u1xt+1
 u2x1     u2x2
 u2x2     u2x3
 u2x3     u2x4
  ...      ...
 u2xt   u2xt+1
  ...      ...
 unx1     unx2
 unx2     unx3
 unx3     unx4
  ...      ...
 unxt   unxt+1

What is an acceptable way to structure this data and feed it into DAI such that it is not treated as one entire long sequence, but rather a bunch of not directly related sequences parallel in time?

Edit: The data has a 'UserID' column. Can DAI automatically use this to overcome the problem I am explaining?

KOB
  • 4,084
  • 9
  • 44
  • 88

1 Answers1

1

To format your data for forecasting, you need to aggregate your data for each group of interest and for a specific time period (in your case one day).

So if your forecast horizon is one day, you need to aggregate by user, your single-valued variable, and by day so that you have a target (label) as a total amount per day. You can find documentation on how to setup your data for driverless here and here.

EDIT in response to comment:

Here is another example to explain the expected data format using the assumption that each user should be aggregated at the day level:

If you have one day’s worth of data for 5 users your dataset should only have 5 rows, but if you have 10 days worth of data for 5 users you should have 50 rows of data.

Then in Driverless AI when you set up your experiment you would set your Time Group to the User column

Lauren
  • 5,640
  • 1
  • 13
  • 19
  • Hi Lauren. Thanks for the explanation and the links. So from what I gather, the sample data I provided is already formatted correctly to be fed into DAI? It has the entire time-series sequence of length t of data for user 1 in the first t rows, followed by the t rows of data for user 2, etc. – KOB Aug 20 '18 at 13:11
  • @KOB updated the answer to clarify the answer. The new example should help clarify why the sample data you provided is not yet formatted correctly for DAI. – Lauren Aug 20 '18 at 18:58
  • I still do not understand how this is different to my original example. In my example, I have t days of data for n users, hence the data set has txn (t times n) rows. Therefore, if I had t=10 days of data, and n=5 users, my data set would have 50 rows, just like you have suggested. – KOB Aug 20 '18 at 19:05
  • 1
    I see, I misunderstood your t units. Yes your sample is in the correct format then. – Lauren Aug 20 '18 at 20:10