How to improve MySql query that trying to find duplicate entries in large database?

Question

I have to make my query on large database (Snort alerts) to find duplicate entries. However, I came up with bellow query, but it takes so many time to be executed!

SELECT sid, cid, timestamp, sig_name, inet_ntoa(ip_src), layer4_sport,
       inet_ntoa(ip_dst), layer4_dport
  FROM DB
 WHERE (ip_dst IN
        (SELECT ip_dst FROM DB GROUP BY ip_dst HAVING count(*) > 1)
   AND timestamp IN
        (SELECT timestamp FROM DB GROUP BY timestamp HAVING count(*) > 1)
   AND layer4_dport IN
        (SELECT layer4_dport FROM DB GROUP BY layer4_dport HAVING count(*)>1 ))

The above query trying to find alerts ip_dst that has same timestamp and layer4_dport if the they came more than one time. I hope its clear!

Any tips or tricks to make it efficient?

score 1 · Answer 1 · edited May 23 '17 at 10:24

1

The link below could help you.

Find duplicate records in MySQL

I hope that this post can help you optimize your query.

edited May 23 '17 at 10:24

Community

1
1

answered Mar 04 '12 at 09:14

Aristotelis Kostopoulos

835
4
15

Thanks a lot. My query is similar less or more to the ideas that have been posted on the post. The query is working fine, but I am looking for a better way to make it efficient. – Aymen Mar 04 '12 at 09:24
@user1247874: Not actually. As you can see from the post above given by Ben, your query is far from the ideas presented in the link that I gave you. Check the answer given above and then the one given in the link by Powerlord. You should be getting closer to what you need by following these. – Aristotelis Kostopoulos Mar 04 '12 at 10:47
Thanks. your interesting is highly appreciated. Aymen – Aymen Mar 04 '12 at 15:36

Ben · Accepted Answer · 2012-03-04T19:25:52.903

I've formatted your query... if we break it down you seem to be applying a couple of functions inet_ntoa. If you don't have a pressing need for then get rid of them ( especially if they look at a table ).

Secondly, if we look at your query you are doing a full scan of DB 3 times for your various counts, and then at the very minimum a range scan in your top level select.

SELECT sid, cid, timestamp, sig_name, inet_ntoa(ip_src), layer4_sport, inet_ntoa(ip_dst), layer4_dport 
  FROM DB 
 WHERE ( ip_dst IN ( SELECT ip_dst 
                       FROM DB 
                      GROUP BY ip_dst 
                     HAVING count(*) > 1 ) 
   AND timestamp IN ( SELECT timestamp 
                        FROM DB 
                       GROUP BY timestamp 
                      HAVING count(*) > 1 ) 
   AND layer4_dport IN ( SELECT layer4_dport 
                           FROM DB 
                          GROUP BY layer4_dport 
                         HAVING count(*) > 1 ) 
        )

By not linking your subquery back to the main table, you've assumed that ip_dst, timestamp and layer4_dport are each unique across the whole table and then are trying to find where the unlikely occurrence of 3 independently unique values happened to have duplicates in the same row.

I suspect what you want to do is something like the following:

SELECT a.sid, a.cid, a.ip_dst, a.timestamp, a.sig_name, a.layer4_sport, a.layer4_dport 
  FROM DB a 
  JOIN ( SELECT timestamp, layer4_dport 
           FROM DB 
          GROUP BY timestamp, layer4_dport
         HAVING count(*) > 1 ) b
    ON a.timestamp = b.timestamp
   AND a.layer4_dport = b.layer4_dport

This finds you all the rows where there are more than 1 identical timestamp and layer4_dport combinations as per your question.

If you want to find all the duplicates at the level of ip_dst then you need to add this to your sub-query.

Thank you Ben,I have added ip_dst and also I returned inet_ntoa to the query cause I want to print the result of it.But,always I got error Column'timestamp' in field list is ambiguous and the query is like the following: SELECT sid, cid, timestamp, sig_name, inet_ntoa(ip_src), layer4_sport, inet_ntoa(ip_dst), layer4_dport FROM DB a JOIN ( SELECT timestamp, layer4_dport, ip_dst FROM DB GROUP BY timestamp, layer4_dport, ip_dst HAVING count(*) > 1 ) b ON a.timestamp = b.timestamp AND a.layer4_dport = b.layer4_dport AND a.ip_dst = b.ip_dst -Aymen — Aymen, Mar 04 '12 at 15:33
Working fine now. I just added a.timestamp on the select sentence as following: SELECT sid, cid, a.timestamp, sig_name, inet_ntoa(ip_src), layer4_sport, inet_ntoa(a.ip_dst), a.layer4_dport FROM DB a .... Thanks a lot Ben, -Aymen — Aymen, Mar 04 '12 at 16:05
Ah, yes, sorry @user1247874, I didn't explicitly reference which table the select variables came from. I've updated the answer. — Ben, Mar 04 '12 at 19:26

How to improve MySql query that trying to find duplicate entries in large database?

2 Answers2