7

The question of how to split a field (e.g. a CSV string) into multiple rows has already been answered: Split values over multiple rows.

However, this question refers to MSSQL, and the answers use various features for which there are no RedShift equivalents.

For the sake of completeness, here's an example of what I'd like to do:

Current data:

| Key | Data     |
+-----+----------+
| 1   | 18,20,22 |
| 2   | 17,19    |

Required data:

| Key | Data     |
+-----+----------+
| 1   | 18       |
| 1   | 20       |
| 1   | 22       |
| 2   | 17       |
| 2   | 19       |

Now, I can suggest a walkaround for the case of small, bounded number of elements in the CSV field: use split_part and union over all possible array locations, like so:

SELECT Key, split_part(Data, ',', 1) 
FROM mytable
WHERE split_part(Data, ',', 1) != ""
    UNION
SELECT Key, split_part(Data, ',', 2) 
FROM mytable
WHERE split_part(Data, ',', 2) != ""
-- etc. etc.

However, this is obviously very inefficient, and would not work for longer lists.

Any better ideas on how to do this?

EDIT:

There's also a somewhat similar question regarding multiplying rows: splitting rows in Redshift. However I don't see how this approach can be applied here.

EDIT 2:

A possible duplicate: Redshift. Convert comma delimited values into rows. But nothing new - the answer by @Masashi Miyazaki is similar to my suggestion above, and suffers from the same issues.

etov
  • 2,972
  • 2
  • 22
  • 36
  • See http://stackoverflow.com/questions/25112389/redshift-convert-comma-delimited-values-into-rows/31998832#31998832 – Bob Baxley Sep 25 '15 at 18:17
  • Please see a viable answer for this question here. [https://stackoverflow.com/questions/46784721/redshift-split-single-dynamic-column-into-multiple-rows-in-new-table/46785509#46785509](https://stackoverflow.com/questions/46784721/redshift-split-single-dynamic-column-into-multiple-rows-in-new-table/46785509#46785509) – Jon Scott Oct 17 '17 at 09:43
  • @JonScott, this looks good. Will you frame it as an answer so that I can resolve? – etov Oct 17 '17 at 11:55
  • @etov I have added that as the answer tailored to your question – Jon Scott Oct 17 '17 at 12:59

2 Answers2

2

Here is the Redshift answer, it will work with up to 10 thousand values per row.

Set up test data

create table test_data (key varchar(50),data varchar(max));
insert into test_data
    values
      (1,'18,20,22'),
      (2,'17,19')
;

code

with ten_numbers as (select 1 as num union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 0)
  , generted_numbers AS
(
    SELECT (1000 * t1.num) + (100 * t2.num) + (10 * t3.num) + t4.num AS gen_num
    FROM ten_numbers AS t1
      JOIN ten_numbers AS t2 ON 1 = 1
      JOIN ten_numbers AS t3 ON 1 = 1
      JOIN ten_numbers AS t4 ON 1 = 1
)
  , splitter AS
(
    SELECT *
    FROM generted_numbers
    WHERE gen_num BETWEEN 1 AND (SELECT max(REGEXP_COUNT(data, '\\,') + 1)
                                 FROM test_data)
)
  , expanded_input AS
(
    SELECT
      key,
      split_part(data, ',', s.gen_num) AS data
    FROM test_data AS td
      JOIN splitter AS s ON 1 = 1
    WHERE split_part(data, ',', s.gen_num) <> ''
)
SELECT * FROM expanded_input
order by key,data;
Jon Scott
  • 4,144
  • 17
  • 29
  • I have similar question [here](https://stackoverflow.com/questions/66359957/how-to-split-two-columns-basis-on-what-format-they-are-following-in-redshift-dat) related to split. I wanted to see if you can help me out? – AndyP Feb 25 '21 at 05:03
-2

You welcome to take RDS PostgreSql instance and create dblink to RedShift. Then you can manipulate on result set like on normal PostgreSQL DB and even put result back into RedShift through same dblink.

Yuri Levinsky
  • 1,515
  • 2
  • 13
  • 26
  • Array types are not supported by redshift: http://docs.aws.amazon.com/redshift/latest/dg/c_unsupported-postgresql-datatypes.html, so this isn't a valid answer... – etov Apr 28 '15 at 15:12
  • Would have been extremely happy to, however this is what I get: "[Amazon](500310) Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.; Warnings: Function "string_to_array(text,text)" not supported" (query: select string_to_array(csv_field, ',') from mytable;) – etov Apr 29 '15 at 07:13
  • Cluster version is 1.0.901, and version upgrade is allowed... Do you have any documentation relating to this function (e.g. in RedShift's release notes?) – etov Apr 30 '15 at 11:02
  • Sorry, my mistake. The function that I proposed to use can work on RedShift in the way that I proposed only, and can't be used on tables data as requested. The only way to do it on table data is: take RDS PostgreSql instance and create dblink to RedShift. Then you can manipulate on result set like on normal PostgreSQL DBand even put result back to RedShift through same dblink. – Yuri Levinsky Apr 30 '15 at 15:55
  • Thanks @Yuri. This is indeed a possible walkaround, however it defeats the original purpose of using redshift (e.g. fast querying on large amount of data). In this respect, the approach suggested in the question might be better when possible. Seems like we'll have to wait Amazon will add this... – etov May 03 '15 at 17:27