Test and training with different dataset with MAHOUT

Question

Sorry if it's a noob question, but I'm new to MAHOUT, and I have to do some tests with the MovieLens datasets. What I would like to know if it is possible to train the recommender with u1base.csv, and test the recommender with u1test.csv to determine the precision and recall?

The exemples I found about evaluation they only slipt the data, but I want to use the u1base to train and u1test to test.

The u1base.csv and u1test.csv have the same format "UserId,Item,Rating".

The java code I have:

     File userPreferencesFile = new File("u1base.csv");
      File userTeste = new File("u1test.csv");
      RandomUtils.useTestSeed();

      DataModel dataModel = new FileDataModel(userPreferencesFile);
      DataModel testModel = new FileDataModel(userTeste);


      RecommenderIRStatsEvaluator recommenderEvaluator = new GenericRecommenderIRStatsEvaluator();

      RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
          @Override
          public Recommender buildRecommender(DataModel dataModel) throws TasteException {
              UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
              UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(10, userSimilarity, dataModel);

              return new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);
          }
      };

      IRStatistics statistics = 
              recommenderEvaluator.evaluate(
                      recommenderBuilder, null, dataModel, null, 2, GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, 1.0);
      System.out.format("The recommender precision is %f%n", statistics.getPrecision());
      System.out.format("The recommender recall is %f%n", statistics.getRecall());
  }

any help will be much appreciated

score 1 · Answer 1 · answered Oct 16 '14 at 09:14

GenericRecommenderIRStatsEvaluator(by default) doesn't support different test & training datasets. But if we really want this, we can write our custom Evaluator. To do this we need to understand the internals of an IRStatsEvaluator.

For every user, the Evaluator tries to fetch the most relevant items, i.e. top at(say 10) items. Then it will build & run the recommender for this user & gets top at recommendations.

A = set of most relevant items = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}

B = set of recommended items = {1,2, 11, 12, 13}

Now precision is the proportion of recommended items that are relevant. (How many of the items in recommendations are relevant) i.e. precision = A intersection B / count(B) = 2 ouf of 5 i.e. 0.4

Recall is the proportion of the relevant items included in the recommended items. i.e. recall = A intersection B / count(A) = 2 out of 10 i.e. 0.2

So logic here is getting two sets of items(most relevant & most recommended). The default implementation of IRStatsEvaluator will find these two sets, based on the single datamodel. And we need to customize this in following manner:

Relevant items should be calculated based on test datasets
Recommended items should be calculated based on the train dataset.

Below is the place where relevantItems are calculated. So instead of data model pass test data model to dataSplitter.getRelevantItemsIDs().

//GenericRecommenderIRStatsEvaluator
public IRStatistics evaluate(RecommenderBuilder recommenderBuilder,
                               DataModelBuilder dataModelBuilder,
                               DataModel dataModel,
                               IDRescorer rescorer,
                               int at,
                               double relevanceThreshold,
                               double evaluationPercentage) throws TasteException {
    .......
    FastIDSet relevantItemIDs = dataSplitter.getRelevantItemsIDs(userID, at, theRelevanceThreshold, dataModel);
    .......

}

//CustomizedRecommenderIRStatsEvaluator    
public IRStatistics evaluate(RecommenderBuilder recommenderBuilder,
                               DataModelBuilder dataModelBuilder,
                               DataModel trainDataModel,
                               DataModel testDataModel,
                               IDRescorer rescorer,
                               int at,
                               double relevanceThreshold,
                               double evaluationPercentage) throws TasteException {
    .......
    FastIDSet relevantItemIDs = dataSplitter.getRelevantItemsIDs(userID, at, theRelevanceThreshold, testDataModel);
    .......

}

Except these changes, keep everything else as it is. It will work!!!

I am sorry, there won't be a guide for such information. I downloaded all the source code of mahout & debugged line by line. In my openion that is the best way to understand the internals of any framework. I am sure we can know the answers to the above questions also by debugging. — Rajkumar, Oct 23 '14 at 12:00
Just for example i took it as 10, 10 is the number of relevants items i.e. output of getRelevantItemIds() method, which fetches all the relevant items based on the datamodel. And 5 is the recommendations given by recommender. We will only specify max number of recommendations to fetch by the recommender. And moreover, Evaluator is not used in production. It's purpose is to compare different different recommenders and choose the best that fits our domain. Once we decide on the suitable recommender, we will use that in our application. — Rajkumar, Oct 23 '14 at 12:35

Test and training with different dataset with MAHOUT

1 Answers1