1

I want to train different models for each user in my dataset. Is there built in support for that in Spark MlLib/Pipelines?

If not, what's the easiest/cleanest way to train multiple and separate models for each user?

Marsellus Wallace
  • 17,991
  • 25
  • 90
  • 154

1 Answers1

2

Unfortunately Spark-ML doesn't provide the ability to separate concept "single model - single user". But you can make a custom logic as you wish. I see two possible variants of solving this task. The first scenario for solving this situation is following to the next algorithm (I took everything for example - you will have different steps, but algorithm will logically similar):

  • You must obtain training data for the specific user - (e.g. read data csv file from hdfs, s3 etc.)
  • Train model for the Dataset which depends on the user related data - let's consider the next situation your dataset has two columns - the specific criteria X and user's productivity Y and latest parameter is changeable for user group - you must train your model for instance with LinearRegression so predict if user can do work in the time or can't.
  • Next, you save data to the disk on call trained model depending on the user's id, group or etc.

The second approach is to train your model so it was applicable to every user, you must choose options for algorithm so it didn't depend on group of user, in other words, generalize algorithm of training model to all user groups - in this case, you don't have a sense of separation
"single-model--> single user". If the second variant is more complicated to the implementation on your dataset, follow the first approach.

  • 1
    Option1: What's the best way to parallelize model training for all users? Option2: Could you expand on "train your model so it was applicable to every user"? – Marsellus Wallace Aug 12 '17 at 16:02
  • @Gevorg Option1 - if you meant just parallelization of training process, you can write custom logic with ForkJoinPool or Akka, if you meant process of parallel training of single model, I would recommend you to see on integration of Keras and Spark Ml - https://github.com/maxpumperla/elephas#spark-ml-example Option2 - suggested that you can make general model for all users - it was the only assumption, because i don't know how dataset looks –  Aug 12 '17 at 18:01