0

I have some scraped product data in a database which I'd like to use on my website. I want to write a query that gives back all items where title LIKE "%EXAMPLE%" but then only with unique products.

The issue is that I have multiple rows for 1 item and I only want 1 row per product to be returned (I scrape daily so every day every item get's an extra row). The only difference between the rows is that they have another date and price because thats what I scrape for, price history.

Example: We have 3 items: Pink chocolate, Pink apple and Pink pear. Each item has 3 rows because I scraped them 3 times. so for example (for the purpose of this example I did not add all the other columns):

productId title price isAvailable
ABC123DEF Pink Apple 0.47 1
ABC123DEF Pink Apple 0.42 1
ABC123DEF Pink Apple 0.41 1
ABC333FHG Pink Pear 0.41 1
ABC333FHG Pink Pear 0.41 1
ABC333FHG Pink Pear 0.41 1
FH5845FJG Pink Chocolate 0.41 1
FH5845FJG Pink Chocolate 0.41 1
FH5845FJG Pink Chocolate 0.41 1

The result I want to get is:

productId title price isAvailable
ABC123DEF Pink Apple 0.47 1
ABC333FHG Pink Pear 0.41 1
FH5845FJG Pink Chocolate 0.41 1

It looks like I have to search on title and then filter out duplicate productId's so that I'm left with the right result. I do not know how to do this though.

any thoughts?

sneaker
  • 49
  • 2
  • 8
  • You haven't told us how the output `price` column should be computed - or which row of a set of partial-duplicates should be picked (why isn't there a `scrapedAtUtc` column with the date+time?) – Dai Jul 05 '22 at 07:45
  • You must enumerate rows in a group (for example, a group with the same productId) specifying the ordering which sets the number `1` for a row which must be stored, then delete the rows which have this number above 1. The sorting must provide the rows uniqueness for the deletion to be deterministic. For example, this may be `ORDER BY price DESC`. – Akina Jul 05 '22 at 07:48
  • @Dai I left out all the columns that are not neccessary for this post. What do you mean with how the output price column should be computed? – sneaker Jul 05 '22 at 07:51
  • Does this answer your question? [SQL select only rows with max value on a column](https://stackoverflow.com/questions/7745609/sql-select-only-rows-with-max-value-on-a-column) – Stu Jul 05 '22 at 07:52
  • @stu It looks like it, but not exactly. He needs 1 row per ID based on the highest rev and I need 1 row per title based on product ID ( I cant say highest productId but I really have to exclude all rows with a certain productId after 1 is added to the results) – sneaker Jul 05 '22 at 07:56
  • @Akina Could you rephrase what you said in a simpler way? I don't seem to fully understand what u mean (do note i start off with searching on title, not productid) – sneaker Jul 05 '22 at 08:02
  • 1
    @sneaker you haven't explained how to choose one row of many - you are missing somethig like a date or sequence column, unless you want the row with the highest price. – Stu Jul 05 '22 at 08:06
  • @stu That's basically the question. I do know how to get all products with a title containing what I search for, but the list it returns contains duplicates ( because I scrape the products multiple times, it has multiple saved rows). I need to filter those duplicates out based on their productId so the list I end up with would have each product that contains what I searched for ONCE . I do also have an auto_increment ID and a date column but I don't see how they could be useful here. – sneaker Jul 05 '22 at 08:09

1 Answers1

0

An example:

WITH
cte AS ( 
    SELECT *, ROW_NUMBER() OVER (PARTITION BY productId ORDER BY price DESC) rn
    FROM test
)
DELETE test
FROM test
NATURAL JOIN cte
WHERE cte.rn > 1;

The query saves the rows with maximal price for each productId and removes another rows for this product.

https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=40df8e8e4b3eb206e0f73b7ce3a70aa5

Pay attention - each complete row which must be stored is unique (the rows which must be removed may have complete duplicates).


I just need a query that filters the data but does not delete it. – sneaker

WITH
cte AS ( 
    SELECT *, ROW_NUMBER() OVER (PARTITION BY productId ORDER BY price DESC) rn
    FROM test
)
SELECT *
FROM cte
WHERE rn = 1;

This query does not need the row to be unique, in this case only one row copy is returned. If you need all copies then use RANK() or DENSE_RANK() instead of ROW_NUMBER().


The solution for MySQL version 5.x.

SELECT *
FROM test
WHERE NOT EXISTS ( 
    SELECT NULL
    FROM test t
    WHERE test.productId = t.productId
      AND test.price < t.price
);

This query will return all copies if present. If you need only one copy then add DISTINCT.

Akina
  • 39,301
  • 5
  • 14
  • 25
  • This looks like what I need but if I'm not mistaken you are deleting every row that is not unique. I do have to keep these rows because it's important price history ( I'm making an inflation panel) I just need a query that filters the data but does not delete it. – sneaker Jul 05 '22 at 08:14
  • Thank you. When I try the query it says: Unrecognized statement type. (near "WITH" at position 0) – sneaker Jul 05 '22 at 08:20
  • @sneaker Maybe you use too old MySQL version which does not support CTE? 8.0 needed... if so then use rows enumeration based on user-defined variables, or use NOT EXISTS correlated subquery. – Akina Jul 05 '22 at 08:22
  • this is still french for me. I'll do some more in depth reading because I clearly see I have to get a better general understanding. Thank you anyways :) – sneaker Jul 05 '22 at 08:24
  • @sneaker variant for 5.x added. – Akina Jul 05 '22 at 08:27