I'm working on Redshift SQL with a table that has one string column containing a comma-separated list like so (the real data has 15 values, not just 5):
grid_state
------------------
"T0,NA,NA,NA,NA,"
"T3,T3,T3,NA,NA,"
"U0,T4,T3,T4,NA,"
"T1,NA,NA,T1,NA,"
"T3,T3,NA,T2,NA,"
"T2,T2,T0,T1,NA,"
"T3,T3,T3,T2,T1,"
"U0,T4,T3,T4,U0,"
"T2,T2,T2,T1,T1,"
"T3,T3,T3,T0,NA,"
What I want to do specifically is, ignoring NA
values, count how many pairs of duplicate values I have in each string, so the result here should be:
duplicate_pairs
----------------
0
1
1
1
1
1
1
2
2
1
If I was using a regular programming language I would write a nice little function to go iterate over the values or something fancier, but I'm very new to SQL and although I understand there is a way to use functions, I'm not super confident with them, and haven't found an example online that was close enough to what I needed for me to use it.
What I'm currently doing is essentially what I would like to do in a function, but with very messy repeated case when
statements and new column creations, which is not exactly optimised nor scalable:
with messy_actions as (
select
replace(grid_state, 'NA,', '') as grid_monsters_0,
case when len(grid_monsters_0) > 0
then regexp_count(grid_monsters_0, left(grid_monsters_0, 3)) / 2 end as duplicate_pairs_0,
case when len(grid_monsters_0) > 0
then replace(grid_monsters_0, left(grid_monsters_0, 3), '') end as grid_monsters_1,
case when len(grid_monsters_1) > 0
then regexp_count(grid_monsters_1, left(grid_monsters_1, 3)) / 2 end as duplicate_pairs_1,
case when len(grid_monsters_1) > 0
then replace(grid_monsters_1, left(grid_monsters_1, 3), '') end as grid_monsters_2,
case when len(grid_monsters_2) > 0
then regexp_count(grid_monsters_2, left(grid_monsters_2, 3)) / 2 end as duplicate_pairs_2,
case when len(grid_monsters_2) > 0
then replace(grid_monsters_2, left(grid_monsters_2, 3), '') end as grid_monsters_3,
case when len(grid_monsters_3) > 0
then regexp_count(grid_monsters_3, left(grid_monsters_3, 3)) / 2 end as duplicate_pairs_3,
case when len(grid_monsters_3) > 0
then replace(grid_monsters_3, left(grid_monsters_3, 3), '') end as grid_monsters_4,
case when len(grid_monsters_4) > 0
then regexp_count(grid_monsters_4, left(grid_monsters_4, 3)) / 2 end as duplicate_pairs_4,
from actions
)
select
duplicate_pairs_0 + duplicate_pairs_1 + duplicate_pairs_2 +
duplicate_pairs_3 + duplicate_pairs_4 as duplicate_pairs
This is somewhat preferable to the (even more) brute force solution of checking regexp_count
for each possible value as there are currently 45 possible values and the list might expand in the future.
From what I see in this answer, the solution might be to define a function and use it in the select
statement, but I've also read that this is potentially not the best either, and as I have said, I'm not super comfortable with writing functions in SQL just yet (more used to Python and R).
Edits:
- The first sentence should have more precisely stated that the table I'm working on has one such string column alongside many other columns.
- The table I'm working with has constraints imposed on me by the back-end team, namely I have to work with a generic set of columns common to all the other projects we have, smartly named
generic_[int|string|bool|float]_[0-5]
, so saving the data normalised is not an option.