I am having performance issues with precomuted item-item similarities in Mahout.
I have 4 million users with roughly the same amount of items, with around 100M user-item preferences. I want to do content-based recommendation based on the Cosine similarity of the TF-IDF vectors of the documents. Since computing this on the fly is slow, I precomputed the pairwise similarity of the top 50 most similar documents as follows:
- I used
seq2sparse
to produce TF-IDF vectors. - I used
mahout rowId
to produce mahout matrix - I used mahout
rowSimilarity -i INPUT/matrix -o OUTPUT -r 4587604 --similarityClassname SIMILARITY_COSINE -m 50 -ess
to produce the top 50 most similar documents
I used hadoop to precompute all of this. For 4 million items, the output was only 2.5GB.
Then I loaded the content of the files produced by the reducers into Collection<GenericItemSimilarity.ItemItemSimilarity> corrMatrix = ...
using the docIndex
to decode the ids of the documents. They were already integers, but rowId have decoded them starting from 1, so I have to get it back.
For recommendation I use the following code:
ItemSimilarity similarity = new GenericItemSimilarity(correlationMatrix);
CandidateItemsStrategy candidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems());
MostSimilarItemsCandidateItemsStrategy mostSimilarItemsCandidateItemsStrategy = new SamplingCandidateItemsStrategy(1, 1, 1, model.getNumUsers(), model.getNumItems());
Recommender recommender = new GenericItemBasedRecommender(model, similarity, candidateItemsStrategy, mostSimilarItemsCandidateItemsStrategy);
I am trying it with limited data model (1.6M items), but I loaded all the item-item pairwise similarities in memory. I manage to load everything in main memory using 40GB.
When I want to do recommendation for one user
Recommender cachingRecommender = new CachingRecommender(recommender);
List<RecommendedItem> recommendations = cachingRecommender.recommend(userID, howMany);
The elapsed time for the recommendation process is 554.938583083
seconds, and besides it did not produce any recommendation. Right now I am really concern about the performance of the recommendation. I played with the numbers of CandidateItemsStrategy
and MostSimilarItemsCandidateItemsStrategy
, but I didn't get any improvements in performance.
Isn't it the idea of precomputing everything suppose to speed up the recommendation process?
Could someone please help me and tell me where I am doing wrong and what I am doing wrong.
Also why loading the parwise similarities in main memory explodes exponentially? 2.5GB of files was loaded in 40GB of main memory in Collection<GenericItemSimilarity.ItemItemSimilarity>
mahout matrix?. I know that the files are serialized using IntWritable
, VectorWritable
hashMap key-values, and the key has to repeat for every vector value in the ItemItemSimilarity
matrix, but this is little too much, don't you think?
Thank you in advance.