2

I see a lot of ways to find duplicate records and delete them but I could not find the one I am looking for i.e. to delete them based on certain condition.

The first SQL code below taken from another post works very well in locating duplicates. However, I am having problem withe the second part.

<cfquery datasource="mydatabase" name="duplist">
SELECT  a.*, b.totalCount AS Duplicate
FROM    mytable a
        INNER JOIN
        (
            SELECT  email, COUNT(*) totalCount
            FROM    mytable
            GROUP   BY email
            HAVING  COUNT(*) > 1
        ) b ON a.email = b.email
</cfquery>

<cfoutput query="duplist">
    <CFQUERY DATASOURCE="mydatabase" name="dlist">
    SELECT * FROM mytable
    WHERE userid = '#userid#'
    AND activedate is null
    </CFQUERY>
</cfoutput> 

<cfoutput>
Total To Be Deleted: #dlist.recordcount#
</cfoutput>

I am trying use SELECT (before replace it with DELETE) only those duplicates user emails that do not have an "activedate" based on the output query "duplist" with duplicated email records. I am only getting one record (in #dlist.recordcount#) which is the last record! Isn't the query supposed to loop through them all?

EDITED:

After some feedback and checking the recordcount wasn't counting correctly because it is outside the loop. It will work in retrieving those records without startdate. However, now the real problem is in the following scenario:

ID   USERID   EMAIL          STARTDATE
1    user1    test@test.com 
2    user2    test@test.com  11/01/2014
3    user3    test@test.com 
4    user4    test@test.com  11/02/2014 
5    user5    me@mydomain.com
6    user6    me@mydomain.com

Running through the above code will list these 6 records of duplicate emails. The second query will select(delete) those without startdate i.e. Record 1, 3, 5 and 6. Two problems:

  1. Record 1 and 3 will be deleted, but 2 and 4 are still duplicates
  2. Record 5 and 6 will be deleted and I am left with no user with the email. I need to keep one of them.

How can the codes be modified to deal with the scenarios? I want to keep record #4 and #6 because 4 is newer than 2 and 6 is newer than 6.

4    user4    test@test.com  11/02/2014 
6    user6    me@mydomain.com
Dharman
  • 30,962
  • 25
  • 85
  • 135
Jack
  • 853
  • 1
  • 7
  • 20
  • This is the correct behaviour of `cfoutput`. If you want to see all the records, one by one; move the last statement `Total To Be Deleted: #dlist.recordcount# ` into `....`. You will see the expected output `n` number of times. – Gaurav S Nov 03 '14 at 05:41
  • It may loop thru it there, but there is a condition for the second query. Putting it there isn't going to do a right count of what it suppose to filter in the second query. – Jack Nov 03 '14 at 05:55
  • If you want the right count, you would need to use a variable inside the `` and increment it if `#dlist.recordcount#` is greater than 0. – Gaurav S Nov 03 '14 at 06:27
  • The count maybe wrong but what I really want to achieve is to delete the right duplicate record which has activedate null but the code is not doing it... – Jack Nov 03 '14 at 06:39
  • What is the expected output? As per the edited data sample, which records you want to retain after the deletion. – Gaurav S Nov 03 '14 at 23:00
  • What is the basis of selecting between record #5 and #6? – Gaurav S Nov 04 '14 at 01:05
  • Actually, it doesn't matter but just record #6 is newer. – Jack Nov 04 '14 at 01:31

4 Answers4

1


You can use IN clause in your query. But first you need to create a list of userId's that you are going to pass into your second query. You can do that using valueList like this

<cfset userIdList = valueList(duplist.userId) />

and then pass that list in your query like this

<CFQUERY DATASOURCE="mydatabase" name="dlist">
    SELECT * FROM myTable WHERE userId IN (#userIdList#) AND activeDate is null
<CFQUERY>
Tushar Bhaware
  • 2,525
  • 1
  • 16
  • 29
  • Code not working....Error Executing Database Query. You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use... I am also very wary of using IN. There were other solutions given for finding duplicates using nested SQL using IN. When I ran it, it was so inefficient that it crashed the table every time! – Jack Nov 03 '14 at 05:14
  • @Jack, This code should work. Have you wrapped CFQUERY with CFOUTPUT like this the above query. And don't worry about IN Clause, IN clause is sufficient, there might be problem in your nesting SQL logic. – Tushar Bhaware Nov 03 '14 at 05:24
  • I did. It is throwing error because some user ids has @ sign. I don't know why it's doing that. – Jack Nov 03 '14 at 05:34
  • 1
    @Jack, I guess you table has **userids as varchar**, not as int. If that is the case then use, **QuotedValueList** instead of **valueList** in the above code. – Tushar Bhaware Nov 03 '14 at 05:44
  • Error diagnostics shows: Error Executing Database Query. You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'laizhijie'',''baimaolai1'',''baimaolai2'',''fsabeijing'',''pekfsa'',''19900917''' at line 1 The error occurred on line 20. Line 20 is your SQL code: SELECT * FROM myTable WHERE userId IN (#userIdList#) AND activeDate is null – Jack Nov 03 '14 at 05:59
  • @Jack, From the looks of it, there is problem with your db not with the query. You need to sanitize your db. User ids should be integer (recommended approach). You need to dump userIdList using cfdump and check whether any userId's contain single/double quote. Remove those manually and then run above query. – Tushar Bhaware Nov 03 '14 at 06:09
  • I just checked. There isn't anyone with ID that has single or double quote. It was not allowed. – Jack Nov 03 '14 at 06:23
  • @Jack, Sorry, I should have also asked to check for #. If that doesn't work then you need to check with one of your senior. I can help as much as this. I hope it will work. – Tushar Bhaware Nov 03 '14 at 07:11
  • The logic of this answer is fine. The error message posted by Jack shows either double quotes or two single quotes around each name. That's likely the problem. I suggest outputting the list first to look at your data. I also suggest using query parameters even if it's not a user input. – Dan Bracuk Nov 03 '14 at 10:36
  • There is no # either. Dan, the quotes resulted from the QuotedValueList function. Looks like the original code was working. I have updated the question. Thank you. – Jack Nov 03 '14 at 20:16
1

This can be done in single query also

<CFQUERY DATASOURCE="mydatabase" name="dlist">
 SELECT * FROM myTable 
 WHERE userId IN (SELECT userId
    FROM    mytable 
    GROUP   BY email
    HAVING  COUNT(email) > 1
   ) 
   AND activeDate is null
<CFQUERY>
Keshav jha
  • 1,356
  • 8
  • 22
1

This is your original query:

SELECT  a.*, b.totalCount AS Duplicate
FROM    mytable a
    INNER JOIN
    (
        SELECT  email, COUNT(*) totalCount
        FROM    mytable
        GROUP   BY email
        HAVING  COUNT(*) > 1
    ) b ON a.email = b.email

All you need to do is add:

where a.activeDate is null

and you have your records. Then once you have seen them, replace your select clause with the word "delete"

Edit Starts Here

Since this will still leave duplicates, after you do the above, do something like this:

select id, email -- can be changed to delete after
from myTable deleteFrom join myTable keep on deleteFrom.email = keep.email
where deleteFrom.id < keep.id

That will keep the newest record.

Dan Bracuk
  • 20,699
  • 4
  • 26
  • 43
  • This works! Thank you Dan. But there is an underlying problem with my selection of records. I have updated my question. Please take a quick look. – Jack Nov 03 '14 at 20:26
  • This I believe may take care of problem 1 if I delete duplicate using the first query. But the problem is the first query will delete the one record I want to keep even if activedate is null. So I will still have problem 2. – Jack Nov 04 '14 at 04:39
  • RE Edit: `where deleteFrom.id < keep.id` That logic will should work fine for a one off. (Assuming the latest start date is linked to the latest "id"). – Leigh Nov 04 '14 at 18:29
1

You can use the Rank() function to rank the data and delete the rows which are not ranked 1.

Oracle equivalent of the Rank() query to select the #4 and #6 data:

SELECT * FROM
(
SELECT USERNAME, EMAIL, START_DATE, 
        RANK() OVER (PARTITION BY EMAIL ORDER BY START_DATE DESC NULLS LAST, ROWNUM DESC) AS RANK
FROM TEMP_SOL
)
WHERE RANK = 1

Mysql equivalent can be derived using this link Rank function in MySQL with Order By clause

Edit : A non-rank solution can be something like:

First get all the duplicate emails

<cfquery name="dupEmail" datasource="XXX">
SELECT EMAIL
FROM TEMP_SOL
GROUP BY EMAIL
HAVING COUNT(*) > 1
</cfquery>

Loop through dupEmail and

  1. select the username which doesn't have to be deleted.
  2. Select/Delete the usernames not equal to the above username having the same emailid.

    <cfloop query="dupEmail">
       <cfquery name="UserToRetain" datasource="XXX">
           SELECT USERNAME FROM TEMP_SOL
           WHERE EMAIL = '#dupEmail.Email#'
           ORDER BY START_DATE DESC NULLS LAST, ROWNUM DESC
           LIMIT 1 
      </cfquery>
    
      <cfquery name="DeleteUsers" datasource="XXX">
             SELECT * FROM TEMP_SOL
             WHERE USERNAME <> '#UserToRetain.USERNAME#' AND EMAIL='#dupEmail.Email#'
      </cfquery>
    
    </cfloop>
    

Note: The code above hasn't been tested for and the queries may not run on mysql.

Community
  • 1
  • 1
Gaurav S
  • 999
  • 8
  • 16
  • I am a novice in MySql :( Otherwise I wouldn't have encountered these issues. Is it possible to see how it is done in MySql? – Jack Nov 04 '14 at 01:32
  • Sorry mate, don't have a mysql instance installed on my machine. Look at the mysql "rank function" link, it may not be as difficult as you think. – Gaurav S Nov 04 '14 at 01:49
  • Can't seen to get the rank to display. It didn't retrieve the right number. I suppose to feed in the result from the first query. – Jack Nov 04 '14 at 01:53
  • Added a non-rank solution which might be helpful. `LIMIT 1` was the only bit I added for mysql but not sure if it runs as expected. – Gaurav S Nov 04 '14 at 02:13
  • ...WHERE RANK <> 1 ORDER BY email, id desc; It caused the error Unknown column 'rank' in 'where clause' Not sure how this will solve the duplicated record problem... – Jack Nov 04 '14 at 02:16
  • Looks like you are having issues in setting the rank using mysql. Try the non-rank solution as explained in the answer Edit. – Gaurav S Nov 04 '14 at 02:20
  • Let me try it and let you know. Thanks. – Jack Nov 04 '14 at 02:34
  • `` gets the user to be retained for a particular emailid. `cfquery name="DeleteUsers">` will return the records to be deleted. – Gaurav S Nov 04 '14 at 02:38
  • Error Executing Database Query. You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'NULLS LAST, ROWNUM DESC LIMIT 1' at line 3 – Jack Nov 04 '14 at 02:58
  • Check the rownum equivalent in mysql. – Gaurav S Nov 04 '14 at 02:59
  • Also, order by nulls last mysql equivalent. http://stackoverflow.com/questions/2051602/mysql-orderby-a-number-nulls-last – Gaurav S Nov 04 '14 at 03:16
  • Remove `NULLS LAST` from the query. NULLS are considered lower than non-null values in mysql which is opposite to what Oracle does. http://troels.arvin.dk/db/rdbms/#select-order_by – Gaurav S Nov 04 '14 at 03:34
  • the second query not working: SELECT USERID FROM mytable WHERE EMAIL = '#dupEmail.Email#' ORDER BY -STARTDATE DESC LIMIT 1 – Jack Nov 04 '14 at 04:10
  • You mean the ID column? It's like record number. I tried using ID DESC but it will delete the record with lower ID which may have a value in activedate – Jack Nov 04 '14 at 04:50
  • In that case, you need to run `SELECT USERID FROM mytable WHERE EMAIL = '#dupEmail.Email#' ORDER BY STARTDATE DESC, ID DESC LIMIT 1` without the `-` sign in front of `STARTDATE`. Looks like mysql doesn't have ROWNUM and as per your sample data ID looks like an identity column in your table. – Gaurav S Nov 04 '14 at 04:52
  • Two problem with the query: 1 - it is only deleting one duplicate. If there are more than one it's not selecting it. 2. When there are 2 records, the lower id one is being deleted even thou it has startdate and the higher one doesn't – Jack Nov 04 '14 at 05:09
  • Which query has this problem? Is `UserstoRetain` giving you the desired results? – Gaurav S Nov 04 '14 at 05:15
  • No, it's not. It's giving the problems as described. – Jack Nov 04 '14 at 15:30
  • It is taking the oldest record of each email regardless if I sort by ActiveDate ASC or DESC, with or without the ID sort. I think the Limit 1 just take the oldest record, it seems. – Jack Nov 04 '14 at 16:48
  • Looks like it is working. I will edit yours to reflect the working version. – Jack Nov 04 '14 at 18:31
  • I didn't do the negative vote. I don't know what happened to the edit I made on your code. It said it need peer approval. – Jack Nov 05 '14 at 01:46
  • That's alright. So what made it work? You can include it in comments and I'll edit the answer accordingly. – Gaurav S Nov 05 '14 at 01:54
  • I have selected your answer and vote up. First query SELECT *, second query SELECT * and ORDER BY ACTIVEDATE DESC, ID DESC LIMIT 1 and last query WHERE USERID <> '#UserToRetain.USERID#' – Jack Nov 05 '14 at 01:57
  • Jack - You should not need a loop at all. The query posted in [Dan's edit](http://stackoverflow.com/a/26715407/104223) worked fine with your sample data, in MySQL. If you were on SQL Server/Oracle I would go with GauravS's original suggestion of using something like rank() or row_number(). – Leigh Nov 06 '14 at 17:04