2

referring to this question:

Finding duplicate values in multiple colums in a SQL table and count

I have the following table structure:

id name1 name2 name3  ...
 1 Hans  Peter Frank
 2 Hans  Frank Peter
 3 Hans  Peter Frank
 4 Paul  Peter Hans
 .
 .
 .

I use the following command to display the duplicates and the counts:

SELECT COUNT(name1), name1, name2, name3 
FROM table 
GROUP BY name1, name2, name3 
HAVING (COUNT(name1) > 1) AND (COUNT(name2) > 1) AND (COUNT(name3) > 1)

This command gives me a count of 2. I would like to know how the second line could also be counted as a dublicate.

Unfortunately, the solution to the original question (Finding duplicate values in multiple colums in a SQL table and count) does not work for char

forpas
  • 160,666
  • 10
  • 38
  • 76
MikaBA
  • 57
  • 4

3 Answers3

1

First normalize the table with UNION ALL in a CTE to get each of the 3 names in a separate row.
Then with ROW_NUMBER() window function you can rank alphabetically the 3 names so that you can group by them:

WITH cte(id, name) AS (
  SELECT id, name1 FROM tablename
  UNION ALL
  SELECT id, name2 FROM tablename
  UNION ALL
  SELECT id, name3 FROM tablename
)
SELECT COUNT(*) count, name1, name2, name3
FROM (
  SELECT id,
         MAX(CASE WHEN rn = 1 THEN name END) name1,
         MAX(CASE WHEN rn = 2 THEN name END) name2,
         MAX(CASE WHEN rn = 3 THEN name END) name3
  FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY name) rn
    FROM cte
  )
  GROUP BY id
)
GROUP BY name1, name2, name3
HAVING COUNT(*) > 1;

Another way to do it, that uses similar logic to your previous question with numeric values, with string function REPLACE() instead of window functions, but works only if the 3 names in each row are different:

SELECT COUNT(*) count,
       MIN(name1, name2, name3) name_1,
       REPLACE(
         REPLACE(
           REPLACE(name1 || ',' || name2 || ',' || name3, MIN(name1, name2, name3), ''), 
           MAX(name1, name2, name3), ''), ',', ''
       ) name_2,
       MAX(name1, name2, name3) name_3
FROM tablename 
GROUP BY name_1, name_2, name_3 
HAVING COUNT(*) > 1;

See the demo.

forpas
  • 160,666
  • 10
  • 38
  • 76
1

Not particularly pretty but a different approach to pivot the columns and then aggregate them as a string and counting duplicates. Slightly fiddly as unfortunately in SQL Lite the group_concat function can't specify any ordering within the group, forcing another level of nesting and a row_number. I guess that's part of being SQL lite!

You could remove the Seq and filtering criteria if you wanted to show all the duplicate variations.

with cte as (
    select Duplicates, name1, name2, name3, 
      Row_Number() over(partition by Duplicates order by name1,name2,name3) Seq
    from (
      select count(*) over(partition by allnames) Duplicates, name1, name2, name3
      from t
      left join (
      select Id, group_concat(Dname) allNames
      from (
          select Id, Dname, row_number() over (partition by Id order by Dname) seq
              from (
                  select id, name1 Dname from t union all
                  select id, name2 from t union all
                  select id, name3 from t
              )x
       )x
     group by Id
     order by seq
     )d on d.id=t.id
   )d
)
select Duplicates, name1, name2, name3
from cte
where duplicates>1 and seq=1

Demo Fiddle

Stu
  • 30,392
  • 6
  • 14
  • 33
0

You just need to sort rows by all columns, group them, count the duplicates, and find the record with the most duplicates. It is rather a hassle to get this written in SQL, which uses a nested query containing window function. The SQL statement is long and difficult to read. An alternative is to export data out of database and handle it in Python or SPL. SPL, the open-source Java package, is easier to be integrated into a Java program and generate much simpler code. It finishes the task with only three lines of code:

A
1 =sqlite.query("select name1,name2,name3 from names")
2 =A1.group([#1,#2,#3].sort();~.len()).maxp(#2)
3 =create(count,name1,name2,name3).record([A2.#2,A2.#1(1),A2.#1(2),A2.#1(3)])