4

I have a pretty standard Mahout item-based recommender for news articles (using click data, so preferences are Boolean):

DataModel dataModel = new ReloadFromJDBCDataModel(
        new PostgreSQLBooleanPrefJDBCDataModel(localDB, ...)
);
ItemSimilarity itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
ItemBasedRecommender recommender = new GenericBooleanPrefItemBasedRecommender(dataModel, itemSimilarity);

I am experimenting with injecting content-based knowledge into the recommender, so that I can most highly recommend articles that are not only similar in the normal collaborative filtering sense, but also similar in the sense that they share many common terms.

The article content similarities (cosine similarity of TF-IDF vectors) are precomputed using a Mahout batch and read from a DB. However, there will be many pairs of articles for which there is no similarity data. This is for 2 reasons:

  • The article content similarity data will be updated less often than the data model of user-item preferences, so there will be a lag before new articles have their content similarity calculated.

  • Ideally I would like to load all content similarity data into memory, so I will only store the top 20 similarities for each article.

So, for a given pair of articles, I have:

  • The item similarity (Tanimoto) 0 <= s1 <= 1
  • The content similarity (Cosine) 0 <= s2 <=1 (maybe null)

In the case where the content similarity is not null, I want to use its value to weight the item similarity, in order to give a boost to articles with similar contents.

My questions are:

  • Is it reasonable to try to combine these measures, or am I attempting something crazy?
  • What is a sensible formula to combine these 2 values into one similarity score?
  • Is this best implemented as a custom ItemSimilarity or as a Rescorer?
Taryn
  • 242,637
  • 56
  • 362
  • 405
Chris B
  • 9,149
  • 4
  • 32
  • 38

1 Answers1

6

Yes it's entirely reasonable to combine them. If both similarities are in [0,1], the most sensible combination is simply their product. This is something you inject using ItemSimilarity, not IDRescorer.

Sean Owen
  • 66,182
  • 23
  • 141
  • 173
  • 1
    Thanks for the response. The reason I asked for a 'sensible formula' is that simply multiplying the similarities will result in a lower score, compared with the case where the article contents are very dissimilar and thus there is no content similarity score available. E.g. itemSimilarity=0.9, contentSimilarity=0.9 -> 0.9 x 0.9 = 0.81. itemSimilarity=0.9, contentSimilarity=null -> 0.9 x null = 0.9. I guess I can just hardcode a very low content similarity value in this case. – Chris B Jan 08 '13 at 05:36
  • Yes of course, but they will all be lower in the same sense. In an item-neighborhood-based algorithm the similarities are just weights in a weighted average. Their absolute size doesn't matter; if you halved all of them the result would be the same. – Sean Owen Jan 08 '13 at 14:56
  • @ChrisB - How did you measure content-similarity? Does mahout do that? I though mahout was purely for collaborative-based recommendations? – user1431072 Apr 01 '14 at 15:11
  • @user1431072 Yes, Mahout can measure content similarity. I precomputed the similarities in a batch. The basic workflow is: Extract articles from DB, convert them to TF-IDF vectors -> Generate a rowid matrix -> Calculate row similarities -> Store results in DB. After some experimentation I went for Cosine as my similarity measure. – Chris B Apr 02 '14 at 00:37