1

So I've got a large amount of SQL data that looks basically like this:

user | src | dst
1    | 1   | 1
1    | 1   | 1
1    | 1   | 2
1    | 1   | 2
2    | 1   | 1
2    | 1   | 3

I want to filter out pairs of (src,dst) that are unique to one user (even if that user has duplicates), leaving behind only those pairs belonging to more than one user:

user | src | dst
1    | 1   | 1
1    | 1   | 1
2    | 1   | 1

In other words, pair (1,2) is unique to user 1 and pair (1,3) to user 2, so they're dropped, leaving behind only all instances of pair (1,1).

Edit: To clarify, I am not interested in the pairs that get filtered out, I need all the rows where the pairs are not unique.

Any ideas? The answers to the question below can find the non-unique pairs, but my SQL-fu doesn't suffice to handle the complication of requiring that they belong to multiple users as well.

How to select non "unique" rows

Community
  • 1
  • 1
lambshaanxy
  • 22,552
  • 10
  • 68
  • 92

3 Answers3

1

Join using EXISTS syntax

SELECT t1.*
FROM   table t1
WHERE EXISTS (SELECT 1 
              FROM   table t2 
              WHERE  t1.src   = t2.src 
               AND   t1.dst   = t2.dst
               AND   t1.user != t2.user)

Another option that MIGHT work (didn't test). If it works it would be more efficient:

SELECT user, src, dst
FROM   table
GROUP BY src, dst
HAVING COUNT(DISTINCT user) > 1
DVK
  • 126,886
  • 32
  • 213
  • 327
  • Your first solution works, but takes 7 seconds on a test table with ~2000 rows; the real table has over half a million entries. The second one is much faster, but mushes together duplicate rows. – lambshaanxy Feb 09 '11 at 07:52
  • On the upside, at least I managed to figure out the size of the answer set by adding a `count(*)` column to the second one and summing them up. – lambshaanxy Feb 09 '11 at 08:11
  • @jpatokal - what are the indexes on the table? – DVK Feb 09 '11 at 08:26
  • @jpatokal - also, try to marry #1 and #2 (e.g. first extract the unique rows into a temp table using #2, and then join your table to that temp table on user, src and dst. It MIGHT be faster than #1, again depending on indexes. – DVK Feb 09 '11 at 08:27
  • user, src and dst are all indexed, the problem is that the WHERE EXISTS is exceedingly slow when the dataset of the subquery is large. But tiko seems to have hit on the solution. – lambshaanxy Feb 09 '11 at 11:13
  • @jpatokal - I think tiko's solution is exactly what I proposed 1 comment above. – DVK Feb 09 '11 at 11:50
1

My solution would be something like this (tested):

SELECT user, src, dst, COUNT(user) as num_of_users
FROM test
GROUP BY src, dst
HAVING num_of_users = 1

Edit: The following code produces the results that you provided in your example.

SELECT test.user, test.src, test.dst
FROM test
INNER JOIN
(
    SELECT user, src, dst, COUNT(DISTINCT user) as num_of_users
    FROM test
    GROUP BY src, dst
    HAVING num_of_users > 1
) as inner_sql
USING(src, dst)

Is this solution the one you are looking for? How is the performance?

tiko
  • 61
  • 3
  • This returns the rows that are unique, but I need the rows that are *not* unique, and `HAVING num_of_users > 1` mushes together duplicated rows. – lambshaanxy Feb 09 '11 at 07:49
  • Ding ding ding, we have a winner! The second version is tolerably fast and seems to produce the correct rows. – lambshaanxy Feb 09 '11 at 11:11
0

May be with some group by?

something like

select user,src,dst,count(*) as c from table group by user,src,dst 
having c = 1
Cesar
  • 1,610
  • 2
  • 16
  • 42