0

So I need to clean up some DB entries that have mistakenly been created with trailing and/or leading whitespaces.

I wrote a small query that gives us all the duplicates, their IDs and what projects they belong to.

(cf. https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=4cf4b42210f48d571acd416123fb9cfb )

I now want to filter by how many distinct projects the duplicates belong to as we have to treat them differently if they're in different projects.

Basically, I'm looking for <unique elements of duplicates_projects> in

WITH all_entries AS (
    SELECT
        e.id AS id,
        trim(e.name) AS name,
        e.project_id AS project_id
    FROM
        entries e
),
duplicates_count AS (
    SELECT
       name,
       count(1) AS matches
    FROM all_entries
    GROUP BY name
    HAVING count(1) > 1
),
all_duplicates AS (
    SELECT
        name,
        matches,
        ARRAY (SELECT e.id FROM all_entries e WHERE e.name = dc.name) AS duplicates_ids,
        ARRAY (SELECT e.project_id FROM all_entries e WHERE e.name = dc.name) AS duplicates_projects
    FROM duplicates_count dc
)
SELECT * FROM all_duplicates WHERE array_length(<unique elements of duplicates_projects>, 1) > 1;
;

I did find eliminate duplicate array values in postgres, but

aggregate functions are not allowed in WHERE

and

SELECT * FROM all_duplicates d
WHERE
    array_length(
        ARRAY(SELECT DISTINCT UNNEST(a.duplicates_projects) FROM all_duplicates a WHERE a.name = d.name), 1
    ) > 1
;

works but seems overly complicated.

Is there a simpler way to do this?

User1291
  • 7,664
  • 8
  • 51
  • 108
  • 1
    Add [DISTINCT](https://www.postgresql.org/docs/9.6/sql-select.html) to the select of your projects ([dbfiddle](https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=63d15f2b0675eb30aff757910dd01db8)) – Luuk Nov 02 '20 at 11:39
  • Unrelated, but: `count(1)` is actually slower than `count(*)` –  Nov 02 '20 at 11:44
  • @Luuk So obvious in hindsight. XD Thanks a lot. Want to make an answer out of it? – User1291 Nov 02 '20 at 12:29
  • I leave that option open to anyone who collects reputation points..... – Luuk Nov 02 '20 at 13:31
  • @a_horse_with_no_name: can you measure the difference? – Luuk Nov 02 '20 at 13:45
  • 1
    @luuk Lukas Eder did: https://blog.jooq.org/2019/09/19/whats-faster-count-or-count1/ "*A significant, consistent difference of almost 10%*" –  Nov 02 '20 at 13:55
  • ah, postgresql.... – Luuk Nov 02 '20 at 14:00

0 Answers0