1

I have this problem where I need to remove duplicates from my data base, but the rows aren't identical.

For example, lets say this is my scheme: (id,name,revirew,uid)

After uploading the data to the database, I have the following rows:

123,Dan,"very good",1000
123,Dan,"very good",2000

I want to keep only the first row, because it's the same review but was recorded with different uid, how can I achieve that? (meaning, it's not exactly a duplicate, but in my database it counts as one).

note: my original problem is that I downloaded a reviews dateset that was crawled. The problem is that there are identical reviews but with different ids generated, so I need to remove them. for simplicity purposes I gave this simple example.

EDIT: I want to keep the row with lower uid.

I've searched and only found ways for removing duplicate rows, but in this case they are not really identical thus not considered duplicates and won't be removed.

[SOLVED]: SELECT id,name,revirew, MIN(uid) FROM ... GROUP BY id,name,revirew; solved the problem, solution from the comments, by @Atom

  • 1
    What is the rule for choosing which time you want to keep? `night > evening > noon > morning` ? – Atmo Dec 29 '22 at 12:21
  • 1
    `group by` every identical value and `max/min` not identical - this is simplest way. Then there is `row_number - partition by... where num = 1` – T.S. Dec 29 '22 at 12:26
  • 2
    *I want to keep only the first row* The term "first" (and next/previous/last/etc.) not exists until you specify the sorting criteria which provides rows uniqueness. This can be, for example, row creation timestamp, or a priority column value taken from another table joined by `time` column ... – Akina Dec 29 '22 at 12:31
  • I've edited the question so that it will be more precise and more relavant to my problem. – james2471993 Dec 29 '22 at 12:48
  • "I want to keep only the first row, because it's .... recorded on different time" I do not see a field **time** in your database ? – Luuk Dec 29 '22 at 12:53
  • 1
    `SELECT id,name,revirew, MIN(uid) FROM ... GROUP BY id,name,revirew`; you insert the result of it in another (temp) table, truncate your table then push the records you saved back into it and drop the temp table – Atmo Dec 29 '22 at 12:55
  • @Luuk, there was another column in the original version of the question instead of uuid. – Atmo Dec 29 '22 at 12:55
  • @james2471993 I kind of disagree with the answers provided in the question that was used to mark this one as a duplicate, IMO, it is important to truncate the orginal table to reset its high watermark, which appears to be mentioned nowhere in the answers... But I do agree **you need to work on a copy first** or at least check the temp table contains what you expect before executing the truncate. – Atmo Dec 29 '22 at 13:03
  • @Atmo Thanks, I agree. I will first look at the copy and temporary changes before committing. My explanation in the question might be lacking of information, but in my case I'm sure that it is ok to regard the two as "duplicates" so this solution does work in my personal case. – james2471993 Dec 29 '22 at 13:19

0 Answers0