10

I have a Rails 5 app. I have a table filled with URL data that is pulled in from various sources:

id     url
1      http://google.com
2      http://yahoo.com
3      http://msn.com
4      http://google.com
5      http://yahoo.com
6      http://askjeeves.com

How can I remove the duplicates from this table?

user3063045
  • 2,019
  • 2
  • 22
  • 35

5 Answers5

23

SQL solution without loops:

Model.where.not(id: Model.group(:url).select("min(id)")).destroy_all

OR

Model.where.not(id: Model.group(:url).select("min(id)")).delete_all

OR

dup_ids = Model.group(:url).select("min(id)").collect{|m| m['min(id)']}
Model.where.not(id: dup_ids).delete_all
#Model.where.not(id: dup_ids).destroy_all 

This will delete all duplicates keeping records with minimum id for duplicate records.

bragboy
  • 34,892
  • 30
  • 114
  • 171
dnsh
  • 3,516
  • 2
  • 22
  • 47
  • Great and simple SQL solution. It took me a long time to find something that works like this. Many other solutions included something like `Model.select(:url).group(:url).having('count("url") > 1')` which always resulted in a `GroupingError: ERROR: column must appear in the GROUP BY clause or be used in an aggregate function`. – mabu Jun 18 '18 at 12:23
  • The best solution should be `Model.where.not(id: Model.group(:url).select("min(id)")).delete_all ` Unfortunately, this throws a "You can't specify target table 'models' for update in FROM clause". The generated SQL is something like this: ```SQL DELETE FROM `models` WHERE `models`.`id` NOT IN (SELECT min(id) FROM `models` GROUP BY `models`.`url`) ``` – Christian Butzke Oct 31 '19 at 09:45
  • (I am using MySQL 5.7) – Christian Butzke Oct 31 '19 at 09:56
6

You can group by url, leave one and delete duplicates:

Model.all.group(:url).values.each do |dup|
  dup.pop #leave one
  dup.each(&:destroy) #destroy other
end
bragboy
  • 34,892
  • 30
  • 114
  • 171
idej
  • 5,034
  • 1
  • 11
  • 14
5

Get array of good ids and then delete all records not in that list.

good_ids = Model.group(:url).pluck("max(id)")
Model.where.not(id: good_ids).delete_all
konyak
  • 10,818
  • 4
  • 59
  • 65
0

// Find all duplicate records and group them by a field

dups = MyModel.group(:url).having('count("url") > 1').count(:name)

// Iterate on each grouped item to destroy duplicate

dups.each do |key, value|

  # Keep one and return rest of the duplicate records

  duplicates = MyModel.where(url: key)[1..value-1]
  puts "#{key} = #{duplicates.count}"
  duplicates.each(&:destroy)

end
Emu
  • 5,763
  • 3
  • 31
  • 51
0

This also seems to be a solution.

I tried to convert it to Ruby, but it got quite complex (since I had more fields to group by), so I ended up just using plain SQL

DELETE t1 FROM 
urls t1
INNER JOIN (
    SELECT MAX(id) AS id, url FROM urls 
    GROUP BY url 
    HAVING COUNT(*) > 1
) t2 on t1.url = t2.url and t1.id != t2.id;

Hope that helps

Christian Butzke
  • 1,080
  • 12
  • 8