20

Here is the DDL --

create table tbl1 (
   id number,
   value varchar2(50)
);

insert into tbl1 values (1, 'AA, UT, BT, SK, SX');
insert into tbl1 values (2, 'AA, UT, SX');
insert into tbl1 values (3, 'UT, SK, SX, ZF');

Notice, here value is comma separated string.

But, we need result like following-

ID VALUE
-------------
1  AA
1  UT
1  BT
1  SK
1  SX
2  AA
2  UT
2  SX
3  UT
3  SK
3  SX
3  ZF

How do we write SQL for this?

Samiul Al Hossaini
  • 980
  • 6
  • 12
  • 24
  • 5
    The design itself is already flawed in my eyes. Storing multiple values in one column already defeats normalization and shouldn´t be done. I´d rather split them before the insert (or don´t use a single value to begin with) and use a loop to insert the single values. – SomeJavaGuy Jul 14 '16 at 10:39
  • 1
    Possible duplicate of [oracle -- Split multiple comma separated values in oracle table to multiple rows](http://stackoverflow.com/questions/18770581/oracle-split-multiple-comma-separated-values-in-oracle-table-to-multiple-rows) – Aleksej Jul 14 '16 at 10:47

6 Answers6

31

I agree that this is a really bad design. Try this if you can't change that design:

select distinct id, trim(regexp_substr(value,'[^,]+', 1, level) ) value, level
  from tbl1
   connect by regexp_substr(value, '[^,]+', 1, level) is not null
   order by id, level;

OUPUT

id value level
1   AA  1
1   UT  2
1   BT  3
1   SK  4
1   SX  5
2   AA  1
2   UT  2
2   SX  3
3   UT  1
3   SK  2
3   SX  3
3   ZF  4

Credits to this

To remove duplicates in a more elegant and efficient way (credits to @mathguy)

select id, trim(regexp_substr(value,'[^,]+', 1, level) ) value, level
  from tbl1
   connect by regexp_substr(value, '[^,]+', 1, level) is not null
      and PRIOR id =  id 
      and PRIOR SYS_GUID() is not null  
   order by id, level;

If you want an "ANSIer" approach go with a CTE:

with t (id,res,val,lev) as (
           select id, trim(regexp_substr(value,'[^,]+', 1, 1 )) res, value as val, 1 as lev
             from tbl1
            where regexp_substr(value, '[^,]+', 1, 1) is not null
            union all           
            select id, trim(regexp_substr(val,'[^,]+', 1, lev+1) ) res, val, lev+1 as lev
              from t
              where regexp_substr(val, '[^,]+', 1, lev+1) is not null
              )
select id, res,lev
  from t
order by id, lev;

OUTPUT

id  val lev
1   AA  1
1   UT  2
1   BT  3
1   SK  4
1   SX  5
2   AA  1
2   UT  2
2   SX  3
3   UT  1
3   SK  2
3   SX  3
3   ZF  4

Another recursive approach by MT0 but without regex:

WITH t ( id, value, start_pos, end_pos ) AS
  ( SELECT id, value, 1, INSTR( value, ',' ) FROM tbl1
  UNION ALL
  SELECT id,
    value,
    end_pos                    + 1,
    INSTR( value, ',', end_pos + 1 )
  FROM t
  WHERE end_pos > 0
  )
SELECT id,
  SUBSTR( value, start_pos, DECODE( end_pos, 0, LENGTH( value ) + 1, end_pos ) - start_pos ) AS value
FROM t
ORDER BY id,
  start_pos;

I've tried 3 approaches with a 30000 rows dataset and 118104 rows returned and got the following average results:

  • My recursive approach: 5 seconds
  • MT0 approach: 4 seconds
  • Mathguy approach: 16 seconds
  • MT0 recursive approach no-regex: 3.45 seconds

@Mathguy has also tested with a bigger dataset:

In all cases the recursive query (I only tested the one with regular substr and instr) does better, by a factor of 2 to 5. Here are the combinations of # of strings / tokens per string and CTAS execution times for hierarchical vs. recursive, hierarchical first. All times in seconds

  • 30,000 x 4: 5 / 1.
  • 30,000 x 10: 15 / 3.
  • 30,000 x 25: 56 / 37.
  • 5,000 x 50: 33 / 14.
  • 5,000 x 100: 160 / 81.
  • 10,000 x 200: 1,924 / 772
vercelli
  • 4,717
  • 2
  • 13
  • 15
  • 3
    See this, for example, for how to avoid duplicates (so you don't have to `select DISTINCT`): https://community.oracle.com/thread/2526535 –  Jul 14 '16 at 11:47
  • Great insight @mathguy. Thanks. – vercelli Jul 14 '16 at 11:54
  • Using `CONNECT BY` like this will cross-connect items with different `id`s and create an exponentially growing list of duplicates which you have to remove with `DISTINCT` and it becomes increasing inefficient as the number of rows and elements per row grows. – MT0 Jul 14 '16 at 11:54
  • 1
    @vercelli Just be aware that it is a hack. Using a correlated table collection expression (as per [my answer](http://stackoverflow.com/a/38373452/1509264) or [this answer](http://stackoverflow.com/a/29255275/1509264)) does not generate duplicates and does not need this trickery to prevent cyclic connections in the data. – MT0 Jul 14 '16 at 12:08
  • @Vercelli - I (now) disagree with MT0 on this. There is no "hack" in using `SYS_GUID()` the way I did. Please see comments to my Answer. –  Jul 14 '16 at 12:26
  • @mathguy , MT0: I made a standar recursive approach. No hacks involve. I'll try to test it with a higher number of records and see how it performs. – vercelli Jul 14 '16 at 14:05
  • Hierarchical queries are often faster than recursive ones. Testing is absolutely the best way to decide. The recursive approach has the advantage that, with a little more effort, you can use standard SUBSTR and INSTR and avoid regular expressions, which are sometimes (often?) more CPU-intensive and may slow you down just a bit. –  Jul 14 '16 at 14:07
  • @mathguy I tested with a bigger dataset and posted the results. It wasn't a very thorough test, but I guess it's interesting – vercelli Jul 14 '16 at 14:34
  • @vercelli What are the `"` units on the numbers and is bigger or smaller better? – MT0 Jul 14 '16 at 15:02
  • @MT0 " means seconds. The lesser the better – vercelli Jul 14 '16 at 15:04
  • A recursive version without regular expressions: `WITH t ( id, value, start_pos, end_pos ) AS ( SELECT id, value, 1, INSTR( value, ',' ) FROM tbl1 UNION ALL SELECT id, value, end_pos + 1, INSTR( value, ',', end_pos + 1 ) FROM t WHERE end_pos > 0 ) SELECT id, SUBSTR( value, start_pos, DECODE( end_pos, 0, LENGTH( value ) + 1, end_pos ) - start_pos ) AS value FROM t ORDER BY id, start_pos;` – MT0 Jul 14 '16 at 15:19
  • @MT0 great, that´s slightly faster. – vercelli Jul 14 '16 at 15:23
  • 1
    @Vercelli - That difference in performance would be a much better reason to prefer the "collections" route (MT0) and your "recursive" route. I am going to test more - I suspect with longer strings (with more tokens per input string) performance may be different. If there is interest, I should probably open a separate question/thread to post test results, etc. For right now, I just tested the hierarchical solution ("my" solution) with 30000 input strings and 4 tokens per string, total output 120000 rows; on my laptop and with the free version of Oracle 11.2, it takes 5.1 seconds to run. –  Jul 14 '16 at 17:40
  • 2
    OK, I just tested with different scenarios. In all cases the recursive query (I only tested the one with regular `substr` and `instr`) does better, by a factor of 2 to 5. Here are the combinations of # of strings / tokens per string and CTAS execution times for hierarchical vs. recursive, hierarchical first. All times in seconds. 30,000 x 4: 5 / 1. 30,000 x 10: 15 / 3. 30,000 x 25: 56 / 37. 5,000 x 50: 33 / 14. 5,000 x 100: 160 / 81. 10,000 x 200: 1,924 / 772 –  Jul 14 '16 at 19:26
  • Will It work will all DMBSs, like MySQL, PostgreSQL, MariaDB, etc? – Raky Mar 31 '21 at 02:56
  • 1
    @Raky it depends. Some of the solutions here are recursive and old versions of MySQL don't support WITH. Redshift doesn't support recursion. Most of the regex will have to be changed, too. So, I'll say that if your DBMS supports recursion go with "Another recursive approach by MT0 but without regex:" solution – vercelli Mar 31 '21 at 11:45
7

This will get the values without requiring you to remove duplicates or having to use a hack of including SYS_GUID() or DBMS_RANDOM.VALUE() in the CONNECT BY:

SELECT t.id,
       v.COLUMN_VALUE AS value
FROM   TBL1 t,
       TABLE(
         CAST(
           MULTISET(
             SELECT TRIM( REGEXP_SUBSTR( t.value, '[^,]+', 1, LEVEL ) )
             FROM   DUAL
             CONNECT BY LEVEL <= REGEXP_COUNT( t.value, '[^,]+' )
           )
           AS SYS.ODCIVARCHAR2LIST
         )
       ) v

Update:

Returning the index of the element in the list:

Option 1 - Return a UDT:

CREATE TYPE string_pair IS OBJECT( lvl INT, value VARCHAR2(4000) );
/

CREATE TYPE string_pair_table IS TABLE OF string_pair;
/

SELECT t.id,
       v.*
FROM   TBL1 t,
       TABLE(
         CAST(
           MULTISET(
             SELECT string_pair( level, TRIM( REGEXP_SUBSTR( t.value, '[^,]+', 1, LEVEL ) ) )
             FROM   DUAL
             CONNECT BY LEVEL <= REGEXP_COUNT( t.value, '[^,]+' )
           )
           AS string_pair_table
         )
       ) v;

Option 2 - Use ROW_NUMBER():

SELECT t.id,
       v.COLUMN_VALUE AS value,
       ROW_NUMBER() OVER ( PARTITION BY id ORDER BY ROWNUM ) AS lvl
FROM   TBL1 t,
       TABLE(
         CAST(
           MULTISET(
             SELECT TRIM( REGEXP_SUBSTR( t.value, '[^,]+', 1, LEVEL ) )
             FROM   DUAL
             CONNECT BY LEVEL <= REGEXP_COUNT( t.value, '[^,]+' )
           )
           AS SYS.ODCIVARCHAR2LIST
         )
       ) v;
MT0
  • 143,790
  • 11
  • 59
  • 117
  • I disagree; `CONNECT BY` with no condition using the `PRIOR` operator, as you do, is a hack (it violates the Oracle requirements for hierarchical queries). I don't see what the hack is when using `PRIOR SYS_GUID()` to break cycles; that is a perfectly legitimate use of hierarchical queries. –  Jul 14 '16 at 12:28
  • 1
    @mathguy Tom Kyte has stated that this is a [bug with the documentation](https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:40476301944675#51117897873960) and that "No, you have never needed `PRIOR` in a connect by." ([link](https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:40476301944675#51097757862313)). – MT0 Jul 14 '16 at 12:46
  • Just reading about `cast(multiset(....))` - if I remember correctly (from reading about this a couple months ago), a nested table in Oracle can only have one column, right? Or is it possible to extract not just the tokens, but also their "level" within the original string? In some cases, presumably the order of the tokens has meaning that may need to be used in further processing. –  Jul 14 '16 at 17:37
  • There are multiple ways to solve that: You can return a UDT as a table of an object type with mutliple fields; or you can use the `ROW_NUMBER()` analytic function. Just 2 that immediately spring to mind. – MT0 Jul 14 '16 at 18:21
  • @MT0 - I'll spend some time on both. In any case, contrary to what I was led to believe, the recursive CTE approach (which I always liked better than the hierarchical approach) seems to be faster, and not just by a small margin. I'll go back to the older thread with a similar question (from a few days ago), the OP there asked if we can help improve performance since his actual data is much "bigger" than the sample he posted. I'll try the recursive query on that exercise, and the `cast(multiset(...))` approach as well if I can figure it out (which I believe you posted already). Thanks! –  Jul 14 '16 at 19:32
  • @mathguy Updated with both methods. – MT0 Jul 14 '16 at 20:06
  • Got that, thanks. I suppose ROWNUM can also be used directly (no difference vs. using it for ordering in ROW_NUMBER()). –  Jul 14 '16 at 20:12
  • @mathguy Not easily - `ROW_NUMBER()` works as you can partition by the id (assuming they are unique - or `ROWID` generated in an inline view if they aren't). `ROWNUM` you need to put the table collection expression in an inline view to generate it starting from 1 for each original row but then the table collection expression is nested too deeply to be referenced and doesn't work. – MT0 Jul 14 '16 at 20:16
5

Vercelli posted a correct answer. However, with more than one string to split, connect by will generate an exponentially-growing number of rows, with many, many duplicates. (Just try the query without distinct.) This will destroy performance on data of non-trivial size.

One common way to overcome this problem is to use a prior condition and an additional check to avoid cycles in the hierarchy. Like so:

select id, trim(regexp_substr(value,'[^,]+', 1, level) ) value, level
  from tbl1
   connect by regexp_substr(value, '[^,]+', 1, level) is not null
          and prior id = id
          and prior sys_guid() is not null
   order by id, level;

See, for example, this discussion on OTN: https://community.oracle.com/thread/2526535

  • I know we've discussed this before but using `SYS_GUID()` is a hack and I believe it is better to use a correlated table collection expression which will never generate these duplicates and so you do not have to resort to work-arounds to deal with them. – MT0 Jul 14 '16 at 11:57
  • We did discuss it before. I know nothing (I only started learning SQL and Oracle this February), but I find that all the gurus on OTN, Tom Kyte, etc. all use the `sys_guid()` or `dbms_random.value()` trick. See the link I provided. Note that `connect by` without a condition using the `PRIOR` operator is already a hack (it violates the Oracle requirements for `CONNECT BY` - see the documentation: https://docs.oracle.com/cd/B28359_01/server.111/b28286/queries003.htm , see the second bullet point after the syntax diagram). –  Jul 14 '16 at 12:04
  • @MT0 - actually, I changed my mind. `CONNECT BY` without a condition using `PRIOR` is a hack and may not be supported in the future. The way I use it **is not a hack** because I do use the `PRIOR` operator in at least one condition. `SYS_GUID()` is guaranteed to produce different values for each row, which then results in there not being cycles in the hierarchy. I don't agree that this is a hack. Why is it a hack? –  Jul 14 '16 at 12:24
  • It is a hack because `AND PRIOR SYS_GUID() IS NOT NULL` will always be true so the condition boils down to `AND TRUE` and should be irrelevant - however, removing it you will get `ORA-01436: CONNECT BY loop in user data`. – MT0 Jul 14 '16 at 12:34
  • That is incorrect. `AND PRIOR SYS_GUID() IS NOT NULL` does **two** things, not one. It evaluates to TRUE in all cases, but it also adds a unique bit of data to each newly generated row. `AND TRUE` only does the first job, not the second. –  Jul 14 '16 at 12:45
3

An alternate method is to define a simple PL/SQL function:

CREATE OR REPLACE FUNCTION split_String(
  i_str    IN  VARCHAR2,
  i_delim  IN  VARCHAR2 DEFAULT ','
) RETURN SYS.ODCIVARCHAR2LIST DETERMINISTIC
AS
  p_result       SYS.ODCIVARCHAR2LIST := SYS.ODCIVARCHAR2LIST();
  p_start        NUMBER(5) := 1;
  p_end          NUMBER(5);
  c_len CONSTANT NUMBER(5) := LENGTH( i_str );
  c_ld  CONSTANT NUMBER(5) := LENGTH( i_delim );
BEGIN
  IF c_len > 0 THEN
    p_end := INSTR( i_str, i_delim, p_start );
    WHILE p_end > 0 LOOP
      p_result.EXTEND;
      p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, p_end - p_start );
      p_start := p_end + c_ld;
      p_end := INSTR( i_str, i_delim, p_start );
    END LOOP;
    IF p_start <= c_len + 1 THEN
      p_result.EXTEND;
      p_result( p_result.COUNT ) := SUBSTR( i_str, p_start, c_len - p_start + 1 );
    END IF;
  END IF;
  RETURN p_result;
END;
/

Then the SQL becomes very simple:

SELECT t.id,
       v.column_value AS value
FROM   TBL1 t,
       TABLE( split_String( t.value ) ) v
MT0
  • 143,790
  • 11
  • 59
  • 117
1
--converting row of data into comma sepaerated string
SELECT
    department_id,
    LISTAGG(first_name, ',') WITHIN GROUP(
        ORDER BY
            first_name
    ) comma_separted_data
FROM
    hr.employees
GROUP BY
    department_id;

--comma-separated string into row of data

CREATE TABLE t (
    deptno          NUMBER,
    employee_name   VARCHAR2(255)
);

INSERT INTO t VALUES (
    10,
    'mohan,sam,john'
);

INSERT INTO t VALUES (
    20,
    'manideeep,ashok,uma'
);

INSERT INTO t VALUES (
    30,
    'gopal,gopi,manoj'
);

SELECT
    deptno,
    employee_name,
    regexp_count(employee_name, ',') + 1,
    regexp_substr(employee_name, '\w+', 1, 1)
FROM
    t,
    LATERAL (
        SELECT
            level l
        FROM
            dual
        CONNECT BY
            level < regexp_count(employee_name, ',') + 1
    );

DROP TABLE t;
0
SELECT  COL1,   COL2
FROM    (   SELECT INDX, MY_STR1, MY_STR2, COL1_ELEMENTS, COL1, COL2_ELEMENTS, COL2
            FROM    (   SELECT 0 "INDX", COL1 "MY_STR1", COL1_ELEMENTS, COL1, '' "MY_STR2", COL2_ELEMENTS, COL2
                        FROM(
                                SELECT
                                    REPLACE(COL1, ', ', ',') "COL1",    -- In case there is a space after comma
                                    Trim(Length(Replace(COL1, ' ', ''))) - Trim(Length(Translate(REPLACE(COL1, ', ', ','), 'A,', 'A'))) + 1 "COL1_ELEMENTS",    -- Number of elements
                                    Replace(COL2, ', ', ',') "COL2",    -- In case there is a space after comma
                                    Trim(Length(Replace(COL2, ' ', ''))) - Trim(Length(Translate(REPLACE(COL2, ', ', ','), 'A,', 'A'))) + 1 "COL2_ELEMENTS"     -- Number of elements
                                FROM
                                    (SELECT 'aaa,bbb,ccc' "COL1", 'qq, ww, ee' "COL2" FROM DUAL)        -- Your example data
                            )
                    )
                MODEL       -- Modeling --> INDX = 0    COL1='aaa,bbb,ccc'      COL2='qq,ww,ee'
                    DIMENSION BY(0 as INDX)
                    MEASURES(COL1, COL1_ELEMENTS, COL2, CAST('a' as VarChar2(4000)) as MY_STR1, CAST('a' as VarChar2(4000)) as MY_STR2)
                    RULES ITERATE (10)      --UNTIL (ITERATION_NUMBER <= COL1_ELEMENTS[ITERATION_NUMBER + 1]) -- If you don't know the number of elements this should be bigger then you aproximation. Othewrwise it will split given number of elements
                    (
                        COL1_ELEMENTS[ITERATION_NUMBER + 1] = COL1_ELEMENTS[0],
                        MY_STR1[0] = COL1[CV()],
                        MY_STR1[ITERATION_NUMBER + 1] = SubStr(MY_STR1[ITERATION_NUMBER], InStr(MY_STR1[ITERATION_NUMBER], ',', 1) + 1),
                        COL1[ITERATION_NUMBER + 1] = SubStr(MY_STR1[ITERATION_NUMBER], 1, CASE WHEN InStr(MY_STR1[ITERATION_NUMBER], ',') <> 0 THEN InStr(MY_STR1[ITERATION_NUMBER], ',')-1 ELSE Length(MY_STR1[ITERATION_NUMBER]) END),
                        MY_STR2[0] = COL2[CV()],
                        MY_STR2[ITERATION_NUMBER + 1] = SubStr(MY_STR2[ITERATION_NUMBER], InStr(MY_STR2[ITERATION_NUMBER], ',', 1) + 1),
                        COL2[ITERATION_NUMBER + 1] = SubStr(MY_STR2[ITERATION_NUMBER], 1, CASE WHEN InStr(MY_STR2[ITERATION_NUMBER], ',') <> 0 THEN InStr(MY_STR2[ITERATION_NUMBER], ',')-1 ELSE Length(MY_STR2[ITERATION_NUMBER]) END)
                    )
        )
WHERE INDX > 0 And INDX <= COL1_ELEMENTS    -- INDX 0 contains starting strings
--
--  COL1  COL2
--  ----  ----
--  aaa   qq
--  bbb   ww
--  ccc   ee
d r
  • 3,848
  • 2
  • 4
  • 15
  • Solution for your original question using MODEL, comments are inside the code. It should work for any two comma separated strings. – d r May 05 '22 at 19:28