0

How can I write an SQL (MYSQL Database) statement that will return a result set which is unique in conjunction with two columns?

To make my question as simple as possible, let's say I have only two tables images and products. I want to query all the images of a certain set of products that have the string Swatch in the the field, images. This query succeeds with the following SQL using a MYSQL database:

SELECT 
    images.alt,
    images.product_id,
    images.src
FROM
    images
    INNER JOIN products
        ON products.product_id IN (
            "2112055640177",
            "2112056590449",
            "2112055378033",
            "2112062292081",
            "2112058490993",
            "2112062619761",
            "2112062488689",
            "2112066420849",
            "2112061833329",
            "2112052527217"
        )
WHERE 
    images.alt LIKE "%Swatch%";

However the result set is full of duplicates:

Black Tuscan - Swatch   2112049971313   foobar.com
Black Tuscan - Swatch   2112049971313   foobar.com
Black Tuscan - Swatch   2112049971313   foobar.com
Generic Black - Swatch  2112049971313   baz.com
Generic Black - Swatch  2112049971313   baz.com
Florence - Swatch   2112049971313   foobaz.com
Florence - Swatch   2112049971313   foobaz.com
Gold - Swatch   2112050593905   bazfoo.com
Gold - Swatch   2112050593905   bazfoo.com
Sand - Swatch   2112050593905   bazfoo.com
Sand - Swatch   2112050593905   bazfoo.com

Every product id set of images has duplicates (example: Black Tuscan - Swatch 2112049971313 foobar.com, Black Tuscan - Swatch 2112049971313 foobar.com)

I would like the result set to not contain duplicate product swatches . In other words, images should be unique by images.alt and images.product_id. Ideally instead of the above return value the result set should the following:

Black Tuscan - Swatch   2112049971313   foobar.com
Generic Black - Swatch  2112049971313   baz.com
Florence - Swatch   2112049971313   foobaz.com
Gold - Swatch   2112050593905   bazfoo.com
Sand - Swatch   2112050593905   bazfoo.com
Black Tuscan - Swatch   1234586923813   foobar.com
Sand - Swatch   1234586923813   bazfoo.com

Notice how for each product id there are no duplicate alt text values.

I have done a lot of searching and I have found these SO posts (one, two, and three). They have suggested combinations of SQL that use either a LEFT JOIN, or GROUP BY, or DISTINCT but I have not been able to find the correct SQL statement that gets me the uniqueness I am looking for - unique by images.alt and images.product_id.

GMB
  • 216,147
  • 25
  • 84
  • 135
robskrob
  • 2,720
  • 4
  • 31
  • 58

2 Answers2

3

You don't need a JOIN to products at all. That might fix your problem.

SELECT i.alt, i.product_id, i.src
FROM images i
WHERE i.product_id IN ('2112055640177', '2112056590449', '2112055378033', '2112062292081', '2112058490993', '2112062619761', '2112062488689', '2112066420849', '2112061833329', '2112052527217')
WHERE i.alt LIKE '%Swatch%';

If that doesn't fix your problem, use GROUP BY:

SELECT i.alt, i.product_id, MAX(i.src)
FROM images i
WHERE i.product_id IN ('2112055640177', '2112056590449', '2112055378033', '2112062292081', '2112058490993', '2112062619761', '2112062488689', '2112066420849', '2112061833329', '2112052527217')
WHERE i.alt LIKE '%Swatch%'
GROUP BY i.alt, i.product_id;

If you actually do want additional columns from products, you need to do the JOIN correctly:

SELECT . . .
FROM images i JOIN
     products p
     USING (product_id)
WHERE p.product_id IN ('2112055640177', '2112056590449', '2112055378033', '2112062292081', '2112058490993', '2112062619761', '2112062488689', '2112066420849', '2112061833329', '2112052527217')
WHERE i.alt LIKE '%Swatch%';
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
3

Your join is messed up, because it does nothing to relate the tables together. I think you need something more like this:

SELECT images.alt, images.product_id, images.src
FROM images
INNER JOIN products
ON products.product_id = images.product_id
WHERE products.product_id IN ("2112055640177","2112056590449","2112055378033","2112062292081","2112058490993","2112062619761","2112062488689","2112066420849","2112061833329","2112052527217") and
images.alt LIKE "%Swatch%";

As to why the duplicates appeared, this is called a Cartesian product. When the join conditions are broken, rows that are not related to each other end up joined.

If you had 2 products and 3 images, and you did:

products JOIN images ON 1=1

The join condition is always true, each image will duplicate 2 times, each product appears 3 times

P1,i1
P1,i2
P1,i3
P2,i1
P2,i2
P2,i3

It may help you to conceive that this is how databases join data: they combine every row from this table with every row from that table, then remove combined rows that do not meet the criteria in the join condition

In your original query, if there were 3 8mages with “swatch” and 10 products in you IN, you get 30 rows, as each of the 10 product rows is combined with each of the 3 images. If you had 100 products and 100 images, conceive that the db joined all combinations, generating 10000 rows, then removed rows not complying with the IN (90/100 products are not allowed, 90 percent of rows are removed, removed 9000 rows, leaving 1000), then removed all rows not complying with the WHERE (97/100 images don’t have swatch, 97 percent of rows removed, 970 rows removed, leaving 30 rows)

Because you didn't select all columns (SELECT *) you don’t see the variations that make each row unique.. put it in and you’ll see

Do not use distinct to remove duplicates, fix the broken join condition that causes the Cartesian product

Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • You're correct. The main problem is the OP seems to be confusing what should go in the "join" clause (link tables) vs. what should go in "where" (filter criteria). See also [this post](https://stackoverflow.com/questions/54066099/). Moreover - since everything he's selecting is from "images" - he probably doesn't need a "join" at all. As Gordon Linoff points out, a simple "select" should be sufficient. – paulsm4 Jan 07 '19 at 17:35
  • True.. I often think with those kinda things tho that the OP might have started out with Products and Images because they knew they needed something out of Products, but it becomes a secondary requirement to the other goal of getting a count(*) working etc, and then come the end of it, they'll try and add the products part back in and get unstuck.. Always a bit of a tricky one to call - go simple and let them ask another question, or assume and potentially end up with another DB out there in the world doing more work than necessary for the rest of its life :) – Caius Jard Jan 07 '19 at 18:07