5

Scnerio:

https://www.funtraker.com is listing movies, tv shows and games. On show page of each resource(Movie, Tv Show etc) we want to list down the related resources.

Schema:

class Movie < AR::Base
  has_many :resource_genres, as: :resource
  has_many :genres, through: :resource_genres
end

class ResourceGenre
  belongs_to :resource, polymorphic: true
end

Now I want to get a list of related movies based on matched genre( two movies are related if both has 'comedy` genre). And these related movies need to order by max number of matched genres.

Well here is sample movies and the expected output.

#Input
Movie         Genres
Movie 1:      horror, comedy, action, war
Movie 2:      action, thriller, crime, animation  
Movie 3:      comedy, war, action, thriller
Movie 4:      crime, animation, action, war

#Expected output
movie1.related_movies => [ movie3, movie2  ]
movie4.related_movies => [ movie2, remaining-three-movies-in-any-order ]
movie3.related_movies => [ movie1, movie2, movie4] 

Hopefully question make sense.

UPDATE: Looking for SQL only solution. I don't need to cache the results in any another table.

Community
  • 1
  • 1
Naveed
  • 11,057
  • 2
  • 44
  • 63
  • How many records will you be doing this with? Using ActiveRecord for a few hundred movies will work ok but this query will not scale. – eabraham Jan 20 '15 at 16:27
  • @eabraham actually Ill only need top 10 related records :) – Naveed Jan 20 '15 at 16:47
  • By top 10 do you mean first 10? – eabraham Jan 20 '15 at 18:15
  • possible duplicate http://stackoverflow.com/questions/8454478/algorithm-advice-calculating-related-stores-based-on-their-category-data – Saqib Jan 20 '15 at 19:06
  • @eabraham yeah top 10 means first 10 entries. Entries would be order by count of max matches so first ten will also be top 10 :) – Naveed Jan 21 '15 at 02:50

2 Answers2

4

You need to order by the group count of the movies ids after joined with resource_genres, take a look at the following pure SQL methods:

Method #1 (Single Query)

Double joining the resource_genres table on itself to maintain self genres ids:

def related_movies
   Movie.select("movies.*, COUNT(*) AS group_count").
   joins(:resource_genres).
   joins("JOIN resource_genres rg ON rg.genre_id = resource_genres.genre_id").
   where("rg.resource_type = 'Movie' 
          AND rg.resource_id = ? 
          AND movies.id != ?", self.id, self.id).
   group('movies.id').
   order('group_count DESC')
end

Method #2 (2 Queries)

Plucking the genre_ids from self resource_genres on a separate query.

def related_movies
   Movie.select("movies.*, COUNT(*) AS group_count").joins(:resource_genres).
   where("resource_genres.genre_id IN (?) 
         AND movies.id != ?", self.resource_genres.pluck(:genre_id), self.id).
   group('movies.id').
   order('group_count DESC')
end
Community
  • 1
  • 1
mohameddiaa27
  • 3,587
  • 1
  • 16
  • 23
  • 1
    I'm going with second answer for now. Will benchmark and post here with which query was more efficient in few days. Thanks for answer and congrats for bounty :) BTW I was missing: "movies.*, COUNT(*) AS group_count" – Naveed Jan 23 '15 at 05:44
  • I'm glad it helped . I usually prefer Method 1, try to keep your benchmarks on 1000+ records per table, and let us know. – mohameddiaa27 Jan 23 '15 at 10:30
0

If you are finding a solution in rails code, then it might solve your problem.

def related_movies
  scores_hash = {}
  Movie.joins(:resource_genres).where('resource_genres.genre_id' => resource_genres.pluck(&:genre_id)).where.not(id: self.id).distinct.find_each do |movie|
    scores_hash[movie] = (movie.resource_genres.pluck(:genre_id) & self.resource_genres.pluck(:genre_id)).count
  end
  Hash[scores_hash.sort_by { |movie, score| -score }].keys
end
Kashif Umair Liaqat
  • 1,064
  • 1
  • 18
  • 27