How to find similar groups based on different IDs

Question

I have a table that looks pretty much like this:

Table

I'm trying to find which cases are duplicates - meaning, which cases have the same Group, parameter1, parameter2, parameter3 I tried using a Cartesian product but then I didn't know if those cases have the same # of rows (since in the where clause I take only the equal ones, like here:

where a.group=b.group
and a.parameter1=b.parameter1
and a.parameter2=b.parameter2
and a.parameter3=b.parameter3
and a.case!=b.case

) and if all of them are the same.

Did anyone work on something similar and might have a solution for this?

Just to make it clearer, in what I sent (the table) - I would like to see that cases 11 and 12 are duplicated and cases 323 and 43 — Tom Levy, Feb 28 '19 at 09:53
An easiest way to do it .... https://stackoverflow.com/questions/2421388/using-group-by-on-multiple-columns Use the GROUP in sql — Thomas_krk, Feb 28 '19 at 10:17

score 0 · Answer 1 · answered Feb 28 '19 at 09:52

you can try by using exists

   select t1.* from table_name t1
    where exists  ( select 1 from table_name t2 where t1.parameter1=t2.parameter1
                                              and t1.parameter2=t2.parameter2
                                              and t1.parameter3=t2.parameter3
                                        having count(*)>1
                  )

score 0 · Answer 2 · answered Feb 28 '19 at 10:09

I am not sure if this is the easiest way to do it but it's working for me, You run a query to find the data you want to compare with and a second query inside the first one for the result...

<?php 

//Retrieves data from MySQL for First Time
$data_p = mysql_query("SELECT * FROM TABLE") or die(mysql_error()); 

 //Puts it into an array 
 while($info = mysql_fetch_array( $data_p )) 
 {
$parameter1_bs=$info['parameter1'];
$parameter2_bs=$info['parameter2'];
$parameter3_bs=$info['parameter3'];


//Retrieves data from MySQL to compare with the previous result 
$data_po = mysql_query("SELECT * FROM TABLE WHERE $parameter1_bs=parameter1 AND $parameter2_bs=parameter2 AND $parameter3_bs=parameter3") or die(mysql_error()); 
$row = mysql_num_rows($data_po); // $row will count how many results

 //Puts it into an array 
 while($info = mysql_fetch_array( $data_pp )) 
 {
$case=$info['case'];
echo $case;
echo "<br>";
}}

?>

score 0 · Answer 3 · answered Feb 28 '19 at 12:59

Use window functions!

select t.*
from (select t.*,
             count(*) over (partition by Group, parameter1, parameter2, parameter3) as cnt
      from t
     ) t
where cnt > 1
order by Group, parameter1, parameter2, parameter3;

You can get the key values that are duplicates with a simple aggregation:

select Group, parameter1, parameter2, parameter3, count(*)
from t
group by Group, parameter1, parameter2, parameter3
having count(*) > 1;

score 0 · Answer 4 · answered Mar 01 '19 at 12:59

This might clarify or confuse a bit more what my predecessors wrote.

It's SQL. And Vertica is one of the databases that support the ANSI 2003 standard, the one with "window based" functions, also called OLAP functions. That's the ones that add an OVER() clause after a function call. Just as Gordon said above - this query actually works - and I just insert your very input data into the first Common Table Expression - the first entry list of the initial WITH clause.

Here goes:

WITH -- your input ....
input(grp,caseno,p1,p2,p3) AS (
          SELECT 'A', 11,'x','x','x'
UNION ALL SELECT 'A', 12,'x','x','x'
UNION ALL SELECT 'C', 21,'x','x','y'
UNION ALL SELECT 'E',323,'y','y','y'
UNION ALL SELECT 'E', 43,'y','y','y'
)
,
find_dups AS (
  SELECT
    *
  , COUNT(*) OVER(PARTITION BY grp,p1,p2,p3) AS occ_count
  FROM input
)
SELECT * FROM find_dups;
-- out  grp | caseno | p1 | p2 | p3 | occ_count 
-- out -----+--------+----+----+----+-----------
-- out  A   |     11 | x  | x  | x  |         2
-- out  A   |     12 | x  | x  | x  |         2
-- out  C   |     21 | x  | x  | y  |         1
-- out  E   |    323 | y  | y  | y  |         2
-- out  E   |     43 | y  | y  | y  |         2
-- out (5 rows)
-- out 
-- out Time: First fetch (5 rows): 17.462 ms. All rows formatted: 17.514 ms

And: from here on, you can filter by 'occ_count', identify the biggest number of duplicates ordering by occ_count, or whatever you finally need.

How to find similar groups based on different IDs

4 Answers4