0

Table holds the images of the products. Each row points to URL of the product's image. Many products have multiple images.

url > Image url
product_id > Product's ID

Some products have duplicate images. I need to leave only one from the duplicates and remove other duplicate URLs for that product.

I cannot group URLs and remove duplicates because there might be another row that has the same URL and have different product_id.

TABLE
-
id | product_id | url | is_primary

enter image description here

Kattarina
  • 79
  • 1
  • 2
  • 11
buraktokman
  • 35
  • 1
  • 8
  • Could you show sample data (a few rows) and expected results ? I can't understand your requirement `I want to remove duplicates with the same URL ... but I cannot remove duplicates because there might be another row that has the same URL and have different product_id` - do you mean "remove duplicates only for the same product id" ? – krokodilko Jul 12 '19 at 09:27
  • image uploaded as example – buraktokman Jul 12 '19 at 09:44
  • 1
    @sirdavalos . . . Your updated data has no duplicate products. – Gordon Linoff Jul 12 '19 at 11:00

2 Answers2

1

Depending on how you want to de-duplicate the records, there are a few different ways to accomplish this. The most straightforward is:

SELECT
    DISTINCT ON (product_id)
          id
        , product_id
        , url
        , is_primary
FROM table

From the documentation:

SELECT DISTINCT eliminates duplicate rows from the result. SELECT DISTINCT ON eliminates rows that match on all the specified expressions. SELECT ALL (the default) will return all candidate rows, including duplicates. (See DISTINCT Clause below.)

It sounds like you'll probably want to use DISTINCT ON instead of DISTINCT; the difference being DISTINCT ON allows you to return columns not part of the distinct component.


If you need to do something more complex (e.g. find the smallest id associated with the product_id), you can add sorting to the query which will make the result deterministic by adding an ORDER BY clause.

The thing to note about ORDER BY clauses used in conjunction with DISTINCT ON is that all of the columns in DISTINCT ON must come first in the ORDER BY. So it becomes possible to find the smallest id associated with each product with something like:

SELECT
    DISTINCT ON (product_id)
          id
        , product_id
        , url
        , is_primary
FROM table
ORDER BY
      product_id
    , id ASC 

Depending on how this table is set up, there may be an easier answer. If is_primary is a boolean column with a partial unique index (see this post and this documentation for more details), you'd be able to get a single id per product_id with a query like:

SELECT
      id
    , product_id
    , url
    , is_primary
FROM table
WHERE is_primary

I mention this because it's a really nice way to filter to a single unique value if your database is set up this way. Depending on your table size, it may also be more performant because Postgres will have an index on disk and won't need to sort through all of the possible records.


If your goal is to actually delete the duplicate records from the table, one way to do it would be to form a query (e.g. one of the above) with the valid records, and then use it as the USING clause for a DELETE statement with a WHERE NOT EXISTS clause.

Adam Bethke
  • 1,028
  • 2
  • 19
  • 35
0

You can use EXISTS to delete the duplicates:

delete from tablename t
where exists (
  select 1 from tablename
  where product_id = t.product_id and url = t.url and id < t.id
)

This will only one of the duplicates urls for each product_id, the one with the smallest id.

forpas
  • 160,666
  • 10
  • 38
  • 76