So I need to clean up some DB entries that have mistakenly been created with trailing and/or leading whitespaces.
I wrote a small query that gives us all the duplicates, their IDs and what projects they belong to.
(cf. https://dbfiddle.uk/?rdbms=postgres_9.6&fiddle=4cf4b42210f48d571acd416123fb9cfb )
I now want to filter by how many distinct projects the duplicates belong to as we have to treat them differently if they're in different projects.
Basically, I'm looking for <unique elements of duplicates_projects>
in
WITH all_entries AS (
SELECT
e.id AS id,
trim(e.name) AS name,
e.project_id AS project_id
FROM
entries e
),
duplicates_count AS (
SELECT
name,
count(1) AS matches
FROM all_entries
GROUP BY name
HAVING count(1) > 1
),
all_duplicates AS (
SELECT
name,
matches,
ARRAY (SELECT e.id FROM all_entries e WHERE e.name = dc.name) AS duplicates_ids,
ARRAY (SELECT e.project_id FROM all_entries e WHERE e.name = dc.name) AS duplicates_projects
FROM duplicates_count dc
)
SELECT * FROM all_duplicates WHERE array_length(<unique elements of duplicates_projects>, 1) > 1;
;
I did find eliminate duplicate array values in postgres, but
aggregate functions are not allowed in WHERE
and
SELECT * FROM all_duplicates d
WHERE
array_length(
ARRAY(SELECT DISTINCT UNNEST(a.duplicates_projects) FROM all_duplicates a WHERE a.name = d.name), 1
) > 1
;
works but seems overly complicated.
Is there a simpler way to do this?