2

I have a CTE that has data like this. It follows two formats pretty much where counts and process_ids will have these two types of data.

client_id      day              counts      process_ids
--------------------------------------------------------------------------------------------
abc1          Feb-01-2021        3        C1,C2 | C3,C4,C5 | C6,C7
abc2          Feb-05-2021       2, 3      C10,C11,C12 | C13,C14 # C15,C16 | C17,C18

Now I want to get this below output from the above CTE after splitting it out on counts and process_ids -

client_id      day              counts      process_ids
--------------------------------------------------------
abc1           Feb-01-2021        3           C1
abc1           Feb-01-2021        3           C2
abc1           Feb-01-2021        3           C3
abc1           Feb-01-2021        3           C4
abc1           Feb-01-2021        3           C5
abc1           Feb-01-2021        3           C6
abc1           Feb-01-2021        3           C7
abc2           Feb-05-2021        2           C10
abc2           Feb-05-2021        2           C11
abc2           Feb-05-2021        2           C12
abc2           Feb-05-2021        2           C13
abc2           Feb-05-2021        2           C14
abc2           Feb-05-2021        3           C15
abc2           Feb-05-2021        3           C16
abc2           Feb-05-2021        3           C17
abc2           Feb-05-2021        3           C18

Basically, the idea is to split counts and process_ids basis on the below two use cases if they follow any of those formats.

UseCase 1

If counts column only has single-digit and process_ids column has | delimiter.

enter image description here

UseCase 2

If counts column only has two-digit separated by a , delimiter and process_ids column has # delimiter along with pipe.

enter image description here

I am working with Amazon Redshift here and I am confused about how can I split them out as needed.

Is this possible to do by any chance?

AndyP
  • 527
  • 1
  • 14
  • 36
  • Do you have any queries you’ve attempted and their output or error message? – Allen M Feb 27 '21 at 05:17
  • 1
    I am kind confuse here on what methods I am suppose to use to make this work. I tried reading on how to use regex split but couldn't fully understand. – AndyP Feb 27 '21 at 07:09

2 Answers2

1

This might look a bit hairy at first sight but has been built up from solid techniques and gives the desired result...

SQL

WITH seq_0_9 AS (
  SELECT 0 AS d
  UNION ALL SELECT 1 AS d
  UNION ALL SELECT 2 AS d
  UNION ALL SELECT 3 AS d
  UNION ALL SELECT 4 AS d
  UNION ALL SELECT 5 AS d
  UNION ALL SELECT 6 AS d
  UNION ALL SELECT 7 AS d
  UNION ALL SELECT 8 AS d
  UNION ALL SELECT 9 AS d
),
numbers AS (
  SELECT a.d + b.d * 10 + c.d * 100 + 1 AS n
  FROM seq_0_9 a, seq_0_9 b, seq_0_9 c
),
processed AS
  (SELECT client_id,
          day,
          REPLACE(counts, ' ', '') AS counts,
          REPLACE(REPLACE(process_ids, ' ', ''), '|', ',') AS process_ids
   FROM tbl),
split_pids AS
  (SELECT
     client_id, 
     day,
     counts,
     split_part(process_ids, '#', n) AS process_ids,
     n AS n1
   FROM processed
   CROSS JOIN numbers
   WHERE 
     split_part(process_ids, '#', n) IS NOT NULL
     AND split_part(process_ids, '#', n) != ''),
split_counts AS
  (SELECT
     client_id, 
     day,
     split_part(counts, ',', n) AS counts,
     process_ids,
     n1,
     n AS n2
   FROM split_pids
   CROSS JOIN numbers
   WHERE
     split_part(counts, ',', n) IS NOT NULL
     and split_part(counts, ',', n) != ''),
matched_up AS
  (SELECT * FROM split_counts WHERE n1 = n2)
SELECT
  client_id, 
  day,
  counts,
  split_part(process_ids, ',', n) AS process_ids
FROM
  matched_up
CROSS JOIN
  numbers
WHERE
  split_part(process_ids, ',', n) IS NOT NULL
  AND split_part(process_ids, ',', n) != '';

Demo

Online rextester demo (using PostgreSQL but should be compatible with Redshift): https://rextester.com/FNA16497

Brief Explanation

This technique is used to generate a numbers table (from 1 to 1000 inclusive). This technique is then used multiple times with multiple Common Table Expressions to achieve it in a single SQL statement.

Steve Chambers
  • 37,270
  • 24
  • 156
  • 208
  • 1
    Great answer! One issue will be that in Redshift generate_series() is a leader node only function and cannot be joined to other data. This isn't a big deal b/c there are a number of alternative ways to create numeric series on Redshift. https://stackoverflow.com/questions/22643338/sequence-number-generation-function-in-aws-redshift or https://stackoverflow.com/questions/38667215/redshift-how-can-i-generate-a-series-of-numbers-without-creating-a-table-called for example. – Bill Weiner Feb 27 '21 at 17:19
  • I tried running your suggestion but I am getting an error as `Invalid input syntax for integer : "g"`. Any idea why it could be and how I can resolve it? Looks like this is different between redshift and postgresql @SteveChambers – AndyP Feb 28 '21 at 19:21
  • Good spot - Redshift's `REGEXP_REPLACE` function doesn't have the flag parameter and it looks like all replacements are made by default without needing the `g` global flag. I've now simplified the SQL to use the `REPLACE` function instead - as from your examples it could just trim the spaces rather than the more general case of whitespace. – Steve Chambers Mar 01 '21 at 09:01
  • That makes sense. After replacing those now I am getting different error as `UNSUPPORTED_FEATURE :: Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;`. I am not sure what it means. Also in your example you have `tbl` as a table but in my case that's a `CTE` with required columns as mentioned in my question so I replaced `tbl` with my CTE name. Any idea what is wrong here? @SteveChambers – AndyP Mar 01 '21 at 17:16
  • I am not sure what is the issue but found something [here](https://docs.aws.amazon.com/redshift/latest/dg/c_sql-functions-leader-node.html). Is this related to what @BillWeiner said above? – AndyP Mar 01 '21 at 17:21
  • Any idea on how we can fix this error? @SteveChambers – AndyP Mar 01 '21 at 21:42
  • Yes, this is the error message you get when running a leader-node only function in combination with compute node data. The answers I posted above have approaches to generate a series of numbers w/o the generate_series() function. Since you are looking to make the numbers between 1 and 100, the easiest approach is to make a CTE with the numbers 0-9, cross join it with itself, use each source a a digit (tens and ones) and then add 1 (to get 1-100 instead of 0-99). I can code this out if you need – Bill Weiner Mar 01 '21 at 23:15
  • @BillWeiner sure. Really appreciate your help! – AndyP Mar 01 '21 at 23:55
  • I've adapted the answer to generate a numbers table from one of the links provided by @BillWeiner. Hopefully it will now live up to the promise of being compatible with Redshift but please post again if it doesn't :-) – Steve Chambers Mar 02 '21 at 08:58
  • 1
    Perfect @SteveChambers. Thanks for coding it out. – Bill Weiner Mar 02 '21 at 15:05
1

I have built an example script, starting from this TSV

client_id   day counts  process_ids
abc1    Feb-01-2021 3   C1,C2 | C3,C4,C5 | C6,C7
abc2    Feb-05-2021 2,3 C10,C11,C12 | C13,C14 # C15,C16 | C17,C18

This is the pretty printed version

+-----------+-------------+--------+-------------------------------------------+
| client_id | day         | counts | process_ids                               |
+-----------+-------------+--------+-------------------------------------------+
| abc1      | Feb-01-2021 | 3      | C1,C2 | C3,C4,C5 | C6,C7                  |
| abc2      | Feb-05-2021 | 2,3    | C10,C11,C12 | C13,C14 # C15,C16 | C17,C18 |
+-----------+-------------+--------+-------------------------------------------+

I have written this Miller procedure

mlr --tsv clean-whitespace then put -S '
  if ($process_ids=~"|" && $counts=~"^[0-9]$")
    {$process_ids=gsub($process_ids," *[|] *",",")}
  elif($process_ids=~"[#]")
    {$process_ids=gsub(gsub($process_ids," *[|] *",",")," *# *","#");$counts=gsub($counts,",","#")}'  then \
put '
  asplits = splitnv($counts, "#");
  bsplits = splitnv($process_ids, "#");
  n = length(asplits);
  for (int i = 1; i <= n; i += 1) {
    outrec = $*;
    outrec["counts"] = asplits[i];
    outrec["process_ids"] = bsplits[i];
    emit outrec;
  }
' then \
uniq -a then \
filter -x -S '$counts=~"[#]"' then \
cat -n then \
nest --explode --values --across-records -f process_ids --nested-fs "," then \
cut -x -f n input.tsv

that gives you

client_id       day     counts  process_ids
abc1    Feb-01-2021     3       C1
abc1    Feb-01-2021     3       C2
abc1    Feb-01-2021     3       C3
abc1    Feb-01-2021     3       C4
abc1    Feb-01-2021     3       C5
abc1    Feb-01-2021     3       C6
abc1    Feb-01-2021     3       C7
abc2    Feb-05-2021     2       C10
abc2    Feb-05-2021     2       C11
abc2    Feb-05-2021     2       C12
abc2    Feb-05-2021     2       C13
abc2    Feb-05-2021     2       C14
abc2    Feb-05-2021     3       C15
abc2    Feb-05-2021     3       C16
abc2    Feb-05-2021     3       C17
abc2    Feb-05-2021     3       C18
aborruso
  • 4,938
  • 3
  • 23
  • 40