Find Duplicates in Postgres Array, and Save as Different Field

Question

My Postgres table schema has two fields:

items, which contains an array of (integer) item IDs . If the same ID is in the array twice, it is considered a primary item.

primary_items, another array of (integer) item IDs that I just added to the schema, so it is currently empty across all rows.

What I need to do is: for each row, check if there are duplicate IDs in items, and if so, put one of each duplicate ID in the primary_items field.

Any idea of how to approach this with a query? Ideally, I wouldn't have to write a helper program (Nodejs) to do this, but I can if needed.

Example:

Current:

documents_table
items          primary_items
------------   -----------
{1, 2, 2, 4}    {}
{1, 2, 3}       {}
{3, 3}          {}
{5, 4, 5, 4}    {}

Desired:

documents_table
items          primary_items
------------   -----------
{1, 2, 2, 4}    {2}
{1, 2, 3}       {}
{3, 3}          {3}
{5, 4, 5, 4}    {5,4}

Duplicate of existing questions; please see this [link](https://stackoverflow.com/questions/29897501/find-duplicated-values-on-array-column): — Ms workaholic, Nov 16 '17 at 23:45
Possible duplicate of [Find duplicated values on array column](https://stackoverflow.com/questions/29897501/find-duplicated-values-on-array-column) — Michel Milezzi, Nov 16 '17 at 23:54
I looked through the other question, and it does not answer my problem. I don't want to create a new set of rows, I want to update the existing one with data from that own row. I'll update the question with an example. — K. Barresi, Nov 17 '17 at 14:22
The other question *is* what you want to do. You convert your array to a row set with `unnest`, filter that for duplicates, then convert back to an array with `array_agg` — Scoots, Nov 17 '17 at 14:30
Ok I think I understand it a bit better now. I will self-answer with a full example query once I get it working for others to see. — K. Barresi, Nov 17 '17 at 14:40

klin · Answer 1 · 2018-08-20T21:41:51.743

You can find duplicate elements with this simple query:

select array_agg(item)
from (
    select item
    from unnest('{5, 4, 5, 4}'::int[]) as item
    group by item
    having count(*) > 1
) s

 array_agg 
-----------
 {4,5}
(1 row)

Use the query in a function:

create or replace function find_primary_items(arr anyarray)
returns anyarray language sql immutable
as $$
    select array_agg(item)
    from (
        select item
        from unnest(arr) as item
        group by item
        having count(*) > 1
    ) s
$$;

Now the update is really easy:

update documents_table
set primary_items = coalesce(find_primary_items(items), '{}')
returning *;

   items   | primary_items 
-----------+---------------
 {1,2,2,4} | {2}
 {1,2,3}   | {}
 {3,3}     | {3}
 {5,4,5,4} | {4,5}
(4 rows)

score 1 · Answer 2 · answered Aug 20 '18 at 20:55

SQL Fiddle

You can do this by counting the doubled elements in your array. Unnest the array and count the doubled values with the row_number window function:

https://www.postgresql.org/docs/current/static/functions-window.html

UPDATE documents_table                           -- E
SET primary_items = s.primaries
FROM (
    SELECT 
        id,
        array_agg(array_element) as primaries    -- D.2
    FROM (
        SELECT
            id,
            array_element,
            row_number() OVER (PARTITION BY id, array_element) as same_element_count  -- C
        FROM (
            SELECT 
                items as id,                     -- A
                unnest(items) as array_element   -- B
            FROM 
                documents_table
            ORDER BY 
                id, array_element
        ) s
    ) s
    WHERE same_element_count = 2                -- D.1
    GROUP BY id) s
WHERE items = s.id;

A: needed an ID. Took the array as ID but it would be better to add an ID column for that

B: unnest expands the array to one row per element. This is necessary because the rows can be sorted and used for further window function calculations.

C: row_number window function: The partitions divide the rows. After ordering, each value within an ID is in the same partition. If there are two or more times the same value in an ID these values are in the same partition. row_number counts the elements in one partition. So doubled values get the row number '2'

D.1: Filtering the doubled values

D.2: aggregate all doubled values per ID into an array. That gives your final primary_items columns

E: Update these calculates primary_items array in the table.

Find Duplicates in Postgres Array, and Save as Different Field

2 Answers2