-1

I'm working on Redshift SQL with a table that has one string column containing a comma-separated list like so (the real data has 15 values, not just 5):

grid_state
------------------
"T0,NA,NA,NA,NA,"
"T3,T3,T3,NA,NA,"
"U0,T4,T3,T4,NA,"
"T1,NA,NA,T1,NA,"
"T3,T3,NA,T2,NA,"
"T2,T2,T0,T1,NA,"
"T3,T3,T3,T2,T1,"
"U0,T4,T3,T4,U0,"
"T2,T2,T2,T1,T1,"
"T3,T3,T3,T0,NA,"

What I want to do specifically is, ignoring NA values, count how many pairs of duplicate values I have in each string, so the result here should be:

duplicate_pairs
----------------
0
1
1
1
1
1
1
2
2
1

If I was using a regular programming language I would write a nice little function to go iterate over the values or something fancier, but I'm very new to SQL and although I understand there is a way to use functions, I'm not super confident with them, and haven't found an example online that was close enough to what I needed for me to use it.

What I'm currently doing is essentially what I would like to do in a function, but with very messy repeated case when statements and new column creations, which is not exactly optimised nor scalable:

with messy_actions as (
    select
        replace(grid_state, 'NA,', '') as grid_monsters_0,
        case when len(grid_monsters_0) > 0
            then regexp_count(grid_monsters_0, left(grid_monsters_0, 3)) / 2 end as duplicate_pairs_0,
        case when len(grid_monsters_0) > 0
            then replace(grid_monsters_0, left(grid_monsters_0, 3), '') end as grid_monsters_1,
        case when len(grid_monsters_1) > 0
            then regexp_count(grid_monsters_1, left(grid_monsters_1, 3)) / 2 end as duplicate_pairs_1,
        case when len(grid_monsters_1) > 0
            then replace(grid_monsters_1, left(grid_monsters_1, 3), '') end as grid_monsters_2,
        case when len(grid_monsters_2) > 0
            then regexp_count(grid_monsters_2, left(grid_monsters_2, 3)) / 2 end as duplicate_pairs_2,
        case when len(grid_monsters_2) > 0
            then replace(grid_monsters_2, left(grid_monsters_2, 3), '') end as grid_monsters_3,
        case when len(grid_monsters_3) > 0
            then regexp_count(grid_monsters_3, left(grid_monsters_3, 3)) / 2 end as duplicate_pairs_3,
        case when len(grid_monsters_3) > 0
            then replace(grid_monsters_3, left(grid_monsters_3, 3), '') end as grid_monsters_4,
        case when len(grid_monsters_4) > 0
            then regexp_count(grid_monsters_4, left(grid_monsters_4, 3)) / 2 end as duplicate_pairs_4,
    from actions
)
select
    duplicate_pairs_0 + duplicate_pairs_1 + duplicate_pairs_2 +
    duplicate_pairs_3 + duplicate_pairs_4 as duplicate_pairs

This is somewhat preferable to the (even more) brute force solution of checking regexp_count for each possible value as there are currently 45 possible values and the list might expand in the future.

From what I see in this answer, the solution might be to define a function and use it in the select statement, but I've also read that this is potentially not the best either, and as I have said, I'm not super comfortable with writing functions in SQL just yet (more used to Python and R).

Edits:

  1. The first sentence should have more precisely stated that the table I'm working on has one such string column alongside many other columns.
  2. The table I'm working with has constraints imposed on me by the back-end team, namely I have to work with a generic set of columns common to all the other projects we have, smartly named generic_[int|string|bool|float]_[0-5], so saving the data normalised is not an option.
Arthur Spoon
  • 442
  • 5
  • 18
  • save the data normalized, that save you the trouble of splitting it, which costs lots of resources. see https://stackoverflow.com/questions/3653462/is-storing-a-delimited-list-in-a-database-column-really-that-bad – nbk Aug 16 '23 at 15:25
  • Thanks for the comment @nbk, reading through that question and the answers was very interesting, however I can't save the data normalised in my case, I've now added some context to the question :) – Arthur Spoon Aug 17 '23 at 06:36
  • Then split the strings and the you have a normalised version, you need a unique columns cc so that you can identify groups, which you didn't provided, so the question can not be good answered – nbk Aug 17 '23 at 07:15

1 Answers1

1

Here is a step by step walkthrough of how you can do this by splitting each value on the comma and then counting occurrences of the value:

create temp table sample_table (val varchar(100));
insert into sample_table
    values
('T0,NA,NA,NA,NA,'),
('T3,T3,T3,NA,NA,'),
('U0,T4,T3,T4,NA,'),
('T1,NA,NA,T1,NA,'),
('T3,T3,NA,T2,NA,'),
('T2,T2,T0,T1,NA,'),
('T3,T3,T3,T2,T1,'),
('U0,T4,T3,T4,U0,'),
('T2,T2,T2,T1,T1,'),
('T3,T3,T3,T0,NA,');

--take out NA vals and give each row an id
with recursive cte AS (
    select row_number() OVER (order by 1) AS id, replace(val, 'NA,', '') AS newval
    from sample_table
),
-- setup a list of numbers 1-15 that we'll use for splitting
numbers(counter_) AS (
    select 1 AS counter_
    union all
    select counter_ + 1
    from numbers
    where counter_ <= 15
),
-- cross join numbers to all vals to use for splitting
split_setup AS (
    select *
    from cte
    cross join numbers
),
-- split the vals from comma lists to rows
split_vals AS (
    select id,
           split_part(newval,',', counter_) AS splitvals
    from split_setup
    where splitvals != ''
),
-- count occurrences of each val
final AS (
    select id,
           splitvals, count(*) AS valcounter
from split_vals
group by 1,2
)
-- check if there are dupes
select id,
       sum(case when valcounter > 1 THEN 1 ELSE 0 END) AS duplicate_pairs

from final

group by 1

order by id
dfundako
  • 8,022
  • 3
  • 18
  • 34