3

Let's say i have a user table like this :

+----+-----------+----------------------+------+
| ID | Name      | Email                | Age  |
+----+-----------+----------------------+------+
| 1  | John      | john.doe1@mail.com   | 24   |
| 2  | Josh      | josh99@mail.com      | 29   |
| 3  | Joseph    | joseph410@mail.com   | 21   |
| 4  | George    | gge.48@mail.com      | 28   |
| 5  | Joseph    | jh.city89@mail.com   | 24   |
| 6  | Kim       | kimsd@mail.com       | 32   |
| 7  | Bob       | bob.s@mail.com       | 38   |
| 8  | Joseph    | psa.jos@mail.com     | 34   |
| 9  | Joseph    | joseph.la@mail.com   | 28   |
| 10 | Jonathan  | jonhan@mail.com      | 22   |
+----+-----------+---------+------------+------+

In the actual, the database consists of more data and some of the data is duplicated, with more than two records. But the point is i want to get only the first and the second row of the duplicated rows that contains the name of "Joseph", How can i achieve this ? My code so far...

User::withTrashed()->groupBy('name')->havingRaw('count("name") >= 1')->get();

With that code the result will retrieve :

+----+-----------+----------------------+------+
| ID | Name      | Email                | Age  |
+----+-----------+----------------------+------+
| 1  | John      | john.doe1@mail.com   | 24   |
| 2  | Josh      | josh99@mail.com      | 29   |
| 3  | Joseph    | joseph410@mail.com   | 21   |
| 4  | George    | gge.48@mail.com      | 28   |
| 6  | Kim       | kimsd@mail.com       | 32   |
| 7  | Bob       | bob.s@mail.com       | 38   |
| 10 | Jonathan  | jonhan@mail.com      | 22   |
+----+-----------+---------+------------+------+

And i use this code to try to get the second duplicated row :

User::withTrashed()->groupBy('name')->havingRaw('count("name") >= 2')->get();

The result is still same as the mentioned above :

+----+-----------+----------------------+------+
| ID | Name      | Email                | Age  |
+----+-----------+----------------------+------+
| 1  | John      | john.doe1@mail.com   | 24   |
| 2  | Josh      | josh99@mail.com      | 29   |
| 3  | Joseph    | joseph410@mail.com   | 21   |
| 4  | George    | gge.48@mail.com      | 28   |
| 6  | Kim       | kimsd@mail.com       | 32   |
| 7  | Bob       | bob.s@mail.com       | 38   |
| 10 | Jonathan  | jonhan@mail.com      | 22   |
+----+-----------+---------+------------+------+

I want the result is to get record that have the id "5" with name "Joseph" like this :

    +----+-----------+----------------------+------+
    | ID | Name      | Email                | Age  |
    +----+-----------+----------------------+------+
    | 1  | John      | john.doe1@mail.com   | 24   |
    | 2  | Josh      | josh99@mail.com      | 29   |
    | 4  | George    | gge.48@mail.com      | 28   |
    | 5  | Joseph    | jh.city89@mail.com   | 24   |
    | 6  | Kim       | kimsd@mail.com       | 32   |
    | 7  | Bob       | bob.s@mail.com       | 38   |
    | 10 | Jonathan  | jonhan@mail.com      | 22   |
    +----+-----------+---------+------------+------+

But it seems only the first duplicate row is retrieved and i can't get the second duplicated row, can anybody give me suggestion ?

utuyutuf
  • 33
  • 8

1 Answers1

4

Let's start from your query

User::withTrashed()->groupBy('name')->havingRaw('count("name") >= 1')->get();

This will show all groups of rows whose count equals to 1 ore more. and this is the description of DISTINCT.

If you want to get only duplicate records you should get groups whose count is LARGER than 1.

The other thing to notice here is that a non-aggrigated column will be chosen randomly. because when you get a name and it's count, for example if you select name,count(name), email (email is not in the group by clause - not aggregated), and 4 rows have the same name. so you'll see:

+--------+-------------+-------+
| Name   | Count(Name) | Email |
+--------+-------------+-------+
| Joseph | 4           | X     |
+--------+-------------+-------+

what do you expect instead of X? which one of the 4 emails? actually, in SQLServer it's forbidden to select a non-aggrigated column and other databases will just give you a random one of the counted 3. see this answer for more details it's explained very well: Do all columns in a SELECT list have to appear in a GROUP BY clause

So, we'll use having count(name) > 1 and select only the aggregated column name

DB::from('users')->select('name')->groupBy('name')->havingRaw('count("name") > 1')->get();

This should give you (didn't test it) this:

+--------+-------------+
| name   | Count(name) |
+--------+-------------+
| Joseph | 4           |
+--------+-------------+

This will give you all names who have 2 or more instances. you can determine the number of duplicates in the having clause. for example having count(name) = 3 will give you all names which have exactly 3 duplicates.

So how to get the second duplicate? I have a question for that:

What is the first (original) duplicate? is it the one with the oldest created_at or the oldest updated_at ? or maybe some other condition?. because of that you should make another query with order by clause to give you the duplicates in the order most convenient to you. for example:

select * from `users` where `name` in  (select `name` from users group by `name` having count(`name`) > 1) order by `id` asc

which will give:

+----+-----------+----------------------+------+
| ID | Name      | Email                | Age  |
+----+-----------+----------------------+------+
| 3  | Joseph    | joseph410@mail.com   | 21   |
| 5  | Joseph    | jh.city89@mail.com   | 24   |
| 8  | Joseph    | psa.jos@mail.com     | 34   |
| 9  | Joseph    | joseph.la@mail.com   | 28   |
+----+-----------+---------+------------+------+
Majed DH
  • 1,251
  • 18
  • 35