How can I delete row from a postgreSQL table where there are repeats on a specific column?

Question

I have a table where I need to delete any rows that have the same url and parent_url. The table has the following columns:

id, time, url, parent_url, source

with duplicates as follows:

1,12-24-2014,asdf.com, http://repeat.com, email
1,12-24-2014,zxzy.com, http://secondrepeat.com, email
2,12-26-2014,asdf.com, http://repeat.com, email
3,12-26-2014,zxzy.com, http://secondrepeat.com, email
4,12-27-2014,asdf.com, http://repeat.com, email

These would need to be reduced to the following 2 rows (having deleted 3 rows):

1,12-24-2014,asdf.com, http://repeat.com, email
1,12-24-2014,zxzy.com, http://secondrepeat.com, email

I have beeing trying to use a for-loop and have been looking for queries that follow this format, and have not found anything that has shed much light on how this count be done.

For example, I tried doing

DELETE FROM scraper WHERE (url, parent_url) IN (SELECT(url, parent_url) FROM scraper GROUP BY url, parent_url having count(*) > 1);

http://stackoverflow.com/questions/22181875/how-to-remove-duplicates-in-a-table — , Oct 13 '14 at 19:31
And of course this: http://stackoverflow.com/search?q=[postgresql]+delete+duplicates — , Oct 13 '14 at 19:31
These examples you've given are significantly less complicated, no? For instance, they give no information on deleting based on `multiple` matches. — maudulus, Oct 13 '14 at 19:56

score 3 · Accepted Answer · answered Oct 13 '14 at 20:03

3

You can use not exists, for instance:

delete from scraper
     where exists (select 1
                   from scraper s2
                   where s2.url = scraper.url and s2.prent_url = s.parent_url and
                         s2.time < s.time
                  );

Or, you can adapt the referenced solutions for something like this:

delete from scraper
     where id not in (select min(id)
                      from scraper s2
                      group by s2.url, s2.prent_url 
                     );

It is unclear from the question whether time or id provides a better choice for the min().

answered Oct 13 '14 at 20:03

Gordon Linoff

1,242,037
58
646
786

`DELETE FROM scraper WHERE id NOT IN (SELECT min(id) FROM scraper s2 GROUP BY s2.url, s2.parent_url);` – maudulus Oct 13 '14 at 20:23
Note `min(id)` is not guaranteed to be unique. The second query is not necessarily right. – wildplasser Oct 13 '14 at 22:55

How can I delete row from a postgreSQL table where there are repeats on a specific column?

1 Answers1