4

I have a dataset of 50 Million user-preferences containing 8 million distinct users and 180K distinct products. I am currently using a boolean data model and have a basic tanimoto similarity based recommender in place. I am trying to explore different algorithms for better recommendations and started out with SVD with ALSWR factoriser. I have used the base SVD recommender provided in mahout as follows.

DataModel dataModel = new FileDataModel("/FilePath");

ALSWRFactorizer factorizer = new ALSWRFactorizer(dataModel, 50, 0.065, 15);

recommender = new SVDRecommender(dataModel, factorizer);

As per my basic understanding, i believe the factorisation takes place offline, and it creates the user features and item features. While the actual requests are served by calculating the top products for an user by taking a dot product of user vector and all the possible item vectors.

I have a couple of doubts regarding the approach :-

  1. What is the best way to choose the factorising parameters and how much time does the factorisation usually take? I tried with the above parameters and the factorisation itself ran for 30+ min.
  2. Is there a way to serve real time requests a bit faster, as taking the dot product with all possible item vectors is resulting in a high request time? Is there something as offline SVD?
  3. Looking at the size of the dataset that i have, should i be trying some other factoriser?
tshepang
  • 12,111
  • 21
  • 91
  • 136
user1045047
  • 369
  • 1
  • 2
  • 17

1 Answers1

2

I want to answer all your questions together.

Given the size of your data and the real time request, you should take another approach.

  1. Do an offline item-item similarity calculation which does not need to be done that often for items with lot of ratings. They mostly don't change. You may want to recalculate for item with few ratings.
  2. Calculate the user-items rating prediction per user in real-time using the item-item similarity list. This operation is not that costly since you have a lot less items than users. It's also a constant time operation when the item size doesn't change that much.
fatih
  • 1,395
  • 10
  • 9
  • We already have this approach working. We have implemented a basic item-item, we are now trying to implement a different algorithm using some form of SVD. From what i have read, i was hoping that SVD would have a better recommendation quality. – user1045047 Jan 10 '14 at 14:19
  • I think you are out of luck when you have real-time requirements. By the way: you HOPE for better quality? you may first try out an algorithm and look if it's better or not! – fatih Jan 10 '14 at 22:26
  • That is what my question is about, i have read about SVD, and now i am trying it out, but as stated, my first doubt is about a way to tune the algorithms and get the factorisation running. Other than that, the second thing i asked is whether there is a way to provide recommendations real time, may be it should be a separate question. – user1045047 Jan 12 '14 at 19:18
  • I am very sorry. I forgot about your first question. You should do a grid search for the best parameter for your data. I don't know about the real-time capability. But you can cache the factorization. How fast is the recommendation when using the cache? – fatih Jan 12 '14 at 22:32