AWS Machine Learning Data

Question

I'm using the AWS Machine Learning regression to predict the waiting time in a line of a restaurant, in a specific weekday/time. Today I have around 800k data.

Example Data:

restaurantID (rowID)weekDay (categorical)time (categorical)tablePeople (numeric)waitingTime (numeric - target)
1 sun 21:29 2 23
2 fri 20:13 4 43
...

I have two questions:

1) Should I use time as Categorical or Numeric? It's better to split into two fields: minutes and seconds?

2) I would like in the same model to get the predictions for all my restaurants.

Example: I expected to send the rowID identifier and it returns different predictions, based on each restaurant data (ignoring others data).

I tried, but it's returning the same prediction for any rowID. Why?

Should I have a model for each restaurant?

Vlad · Accepted Answer · 2017-02-04T02:49:29.187

There are several problems with the way you set-up your model

1) Time in the form you have it should never be categorical. Your model treats times 12:29 and 12:30 as two completely independent attributes. So it will never use facts it learn about 12:29 to predict what's going to happen at 12:30. In your case you either should set time to be numeric. Not sure if amazon ML can convert it for you automatically. If not just multiply hour by 60 and add minutes to it. Another interesting thing to do is to bucketize your time, by selecting which half hour or wider interval. You do it by dividing (h*60+m) by some number depending how many buckets you want. So to try 120 to get 2 hr intervals. Generally the more data you have the smaller intervals you can have. The key is to have a lot of samples in each bucket.

2) You should really think about removing restaurantID from your input data. Having it there will cause the model to over-fit on it. So it will not be able to make predictions about restaurant with id:5 based on the facts it learn from restaurants with id:3 or id:9. Having restaurant id there might be okay if you have a lot of data about each restaurant and you don't care about extrapolating your predictions to the restaurants that are not in the training set.

3) You never send restaurantID to predict data about it. The way it usually works you need to pick what are you trying to predict. In your case probably 'waitingTime' is most useful attribute. So you need to send weekDay, time and number of people and the model will output waiting time.

Thank you, helped me a lot! Last two questions: Should I have a ML model for each restaurant? Is it impossible to use the same? — Luciano Nascimento, Feb 04 '17 at 03:45
You don't need to have a model for each restaurant. If you want your model to account for particular restaurants, just include the id as an attribute. But in this case you will need a lot of data for each restaurant and your model will not use data from one restaurant to predict wait time in another. Maybe you should see what is common between different restaurants and create more attributes from that ( class of restaurant, type of food, is it near theater or stadium etc ) — Vlad, Feb 04 '17 at 03:52
@Vlad - Those are excellent recommendations. I would also suggest adding a `Weekday/weekend` field, on the assumption that Monday-Friday are probably similar, so it could make predictions based on that information, rather than treating every single day as independent. Try to think about other things that might introduce variation throughout the year, such as `season` `month`, or even `temperature`. Basically, if I were to walk up to you and say "how long to I have to wait on day X", think about what you'd ask me -- "Is it a weekend?", "Is it a public holiday?", "Was it warm that day?" — John Rotenstein, Feb 04 '17 at 21:11

score 1 · Answer 2 · answered Feb 05 '17 at 22:31

You should think what is relevant for the prediction to be accurate, and you should use your domain expertise to define the features/attributes you need to have in your data.

For example, time of the day, is not just a number. From my limited understanding in restaurant, I would drop the minutes, and only focus on the hours.

I would certainly create a model for each restaurant, as the popularity of the restaurant or the type of food it is serving is having an impact on the wait time. With Amazon ML it is easy to create many models as you can build the model using the SDK, and even schedule retraining of the models using AWS Lambda (that mean automatically).

I'm not sure what the feature called tablePeople means, but a general recommendation is to have as many as possible relevant features, to get better prediction. For example, month or season is probably important as well.

score 0 · Answer 3 · answered Aug 07 '18 at 01:56

In contrast with some answers to this post, I think resturantID helps and it actually gives valuable information. If you have a significant amount of data per each restaurant then you can train a model per each restaurant and get a good accuracy, but if you don't have enough data then resturantID is very informative.

1) Just imagine what if you had only two columns in your dataset: restaurantID and waitingTime. Then wouldn't you think the restaurantID from the testing data helps you to find a rough waiting time? In the simplest implementation, your waiting time per each restaurantID would be the average of waitingTime. So definitely restaurantID is a valuable information. Now that you have more features in your dataset, you need to check if restaurantID is as effective as the other features or not.

2) If you decide to keep restaurantID then you must use it as a categorical string. It should be a non-parametric feature in your dataset and maybe that's why you did not get a proper result.

On the issue with day and time I agree with other answers and considering that you are building your model for the restaurant, hourly time may give a more accurate result.

AWS Machine Learning Data

3 Answers3