8

I have the following query returning duplicate titles, but :id is nil:

Movie.select(:title).group(:title).having("count(*) > 1")

[#<Movie:0x007f81f7111c20 id: nil, title: "Fargo">,
#<Movie:0x007f81f7111ab8 id: nil, title: "Children of Men">,
#<Movie:0x007f81f7111950 id: nil, title: "The Martian">,
#<Movie:0x007f81f71117e8 id: nil, title: "Gravity">]

I tried adding :id to the select and group but it returns an empty array. How can I return the whole movie record, not just the titles?

Ashbury
  • 2,160
  • 3
  • 27
  • 52

2 Answers2

15

A SQL-y Way

First, let's just solve the problem in SQL, so that the Rails-specific syntax doesn't trick us.

This SO question is a pretty clear parallel: Finding duplicate values in a SQL Table

The answer from KM (second from the top, non-checkmarked, at the moment) meets your criteria of returning all duplicated records along with their IDs. I've modified KM's SQL to match your table...

SELECT
  m.id, m.title
FROM 
  movies m
INNER JOIN (
  SELECT
    title, COUNT(*) AS CountOf
  FROM
    movies
  GROUP BY 
    title
  HAVING COUNT(*)>1
) dupes 
ON
  m.title=dupes.title

The portion inside the INNER JOIN ( ) is essentially what you've generated already. A grouped table of duplicated titles and counts. The trick is JOINing it to the unmodified movies table, which will exclude any movies that don't have matches in the query of dupes.

Why is this so hard to generate in Rails? The trickiest part is that, because we're JOINing movies to movies, we have to create table aliases (m and dupes in my query above).

Sadly, it Rails doesn't provide any clean ways of declaring these aliases. Some references:

Fortunately, since we've got the SQL in-hand, we can use the .find_by_sql method...

Movie.find_by_sql("SELECT m.id, m.title FROM movies m INNER JOIN (SELECT title, COUNT(*) FROM movies GROUP BY title HAVING COUNT(*)>1) dupes ON m.first=.first")

Because we're calling Movie.find_by_sql, ActiveRecord assumes our hand-written SQL can be bundled into Movie objects. It doesn't massage or generate anything, which lets us do our aliases.

This approach has its shortcomings. It returns an array and not an ActiveRecord Relation, which means it can't be chained with other scopes. And, in the documentation for the find_by_sql method, we get extra discouragement...

This should be a last resort because using, for example, MySQL specific terms will lock you to using that particular database engine or require you to change your call if you switch engines.

A Rails-y Way

Really, what is the SQL doing above? It's getting a list of names that appear more than once. Then, it's matching that list against the original table. So, let's just do that using Rails.

titles_with_multiple = Movie.group(:title).having("count(title) > 1").count.keys

Movie.where(title: titles_with_multiple)

We call .keys because the first query returns an hash. The keys are our titles. The where() method can take an array, and we've handed it an array of titles. Winner.

You could argue one line of Ruby is more elegant than two. And if that one line of Ruby has an ungodly string of SQL embedded within it, how elegant is it really?

Hope this helps!

Community
  • 1
  • 1
Lanny Bose
  • 1,811
  • 1
  • 11
  • 16
-2

You can try to add id in your select:

Movie.select([:id, :title]).group(:title).having("count(title) > 1")
akbarbin
  • 4,985
  • 1
  • 28
  • 31
  • 1
    I should have mentioned I tried that but I get "PG::GroupingError: ERROR: column "movies.id" must appear in the GROUP BY clause or be used in an aggregate function" – Ashbury Oct 27 '15 at 07:08
  • Perhaps you have to add your id in your group too. I have updated it. – akbarbin Oct 27 '15 at 07:10
  • 1
    That returns an empty array, I think because it's looking for duplicate :id as well. :( Thanks though. – Ashbury Oct 27 '15 at 07:13
  • Can you tell me what data you want to find? – akbarbin Oct 27 '15 at 07:26
  • 1
    In my example all the ids are nil. I want to find duplicate :titles but return the whole Movie record. – Ashbury Oct 27 '15 at 07:36
  • It happens because of `count(*)`. You can add specific counting by title. Let's see my updated code. – akbarbin Oct 27 '15 at 07:41
  • 1
    Sorry man, just doesn't work. I tried with count(title). Adding :id to select gives PG::GroupingError, adding :id to both gives empty array. – Ashbury Oct 27 '15 at 07:48
  • This won't return anything because column id is not part of the group clause – damuz91 May 23 '22 at 14:42