How create a function for converting a CSV value into a table, using regular expression as a CSV separator?

Question

I need a generic Oracle function which takes a CSV string as a first parameter and a regular expression string, which defines a CSV separator as a second parameter and returns a table of parsed strings like as follows:

INPUT data:

NAME    PROJECT     ERROR
108     test        string-1, string-2 ; string-3
109     test2       single string
110     test3       ab,  ,c

OUTPUT data:

NAME    PROJECT     ERROR
108     test        string-1
108     test        string-2
108     test        string-3
109     test2       single string
110     test3       ab
110     test3       NULL
110     test3       c

the separators might be different in different source tables, so I'd like to be able to specify them dynamically as a regex.

How can I create a generic function out of the following code:

with temp as
(
    select 108 Name, 'test' Project, 'string-1 , string-2 ; string-3' Error  from dual
    union all
    select 109, 'test2', 'single string' from dual
)
select distinct
  t.name, t.project,
  trim(regexp_substr(t.error, '[^,;]+', 1, levels.column_value))  as error
from 
  temp t,
  table(cast(multiset(select level from dual connect by  level <= length (regexp_replace(t.error, '[^,;]+'))  + 1) as sys.OdciNumberList)) levels
order by name;

sql<>fiddle

So I'm thinking of a function that takes the following parameters and returns a table of strings

CREATE OR REPLACE FUNCTION csvstr2tab(
    p_str      IN VARCHAR2,
    p_sep_re   IN VARCHAR2   DEFAULT '\s*[,;]\s*'
)

PS I have used this answer

UPDATE: please note that I'm using the abbreviation "CSV" here just in order to explain that the input string has multiple values, separated by different separators. I'm dealing with a free text, written by human beings, that used different separators. So the input string doesn't have to be a correct CSV in my case - it's just a string separated by multiple different separators.

The question in that degree of generality doesn't make sense. Here is just one example to illustrate the problem. If a single comma is the separator, then 'ab,,c' has three tokens: 'ab', NULL and 'c'. (You did consider NULL, didn't you?). 'ab, ,c' has three tokens: 'ab', ' ' (a single space) and 'c'. If the separator is ', ' (meaning, comma + space), then 'ab, , c' has the same three tokens. But if you want something like '\s*[,;]\s*, how do you interpret 'ab, ,c'? Is the second token a NULL, or a space? **OR** are you assuming that the tokens can't contain any of the characters in the regexp? — , Oct 28 '20 at 19:39
@mathguy, thank you for your comment! I'd like to parse ` 'ab, , c'` as `['ab',NULL,'c']`. Well maybe I should not have used the CSV abbreviation, because I'm dealing with a free text, written by human beings, that used different separators. And yes, I'm assuming that the tokens can't contain contain any separators in the regex. — MaxU - stand with Ukraine, Oct 28 '20 at 20:43

score 3 · Accepted Answer · 2020-10-30T20:50:01.827

Perhaps something like this. I wrote it as a PL/SQL function, as you requested, but note that if the input data resides in the database, this can be done directly in SQL.

For illustration, I called the function with the default separator.

If you are not familiar with pipelined table function, you can read about them in the documentation.

Note also that in Oracle 12.2, but not in 12.1, you can leave out the table( ) operator - you can select directly "from the function".

create type str_t as table of varchar2(4000);
/

create or replace function csvstr2tab(
  p_str    in varchar2,
  p_sep_re in varchar2 default '\s*[,;]\s*'
)
  return str_t
  pipelined
as
begin
  for i in 1 .. regexp_count(p_str, p_sep_re) + 1 loop
    pipe row (regexp_substr(p_str, '(.*?)(' || p_sep_re || '|$)', 1, i, null, 1));
  end loop;
  return;
end;
/

select *
from   table(csvstr2tab('blue  ;green,,brown;,yellow;'))
;

COLUMN_VALUE
--------------------
blue
green
[NULL]
brown
[NULL]
yellow
[NULL]

One more test (note that the first row in the output has two trailing spaces, too):

select *
from   table(csvstr2tab('blue  ;green,,brown;,yellow;', ';'))
;

COLUMN_VALUE
-----------------
blue  
green,,brown
,yellow

EDIT

Here is how the function can be used to break input strings into tokens when the inputs are in a table (rows identified by an ID, for example), and keep track of token order as well.

with
  sample_data(id, str) as (
    select 1201, 'blue  ;green,,brown;,yellow;' from dual union all
    select 1202, 'tinker, tailor, soldier, ...' from dual
  )
select sd.id, sd.str, tf.ord, tf.token
from   sample_data sd,
       lateral ( select rownum as ord, column_value as token
                 from   table(csvstr2tab(sd.str))
               ) tf
order by id, ord
;

    ID STR                             ORD TOKEN   
------ ---------------------------- ------ --------
  1201 blue  ;green,,brown;,yellow;      1 blue    
  1201 blue  ;green,,brown;,yellow;      2 green   
  1201 blue  ;green,,brown;,yellow;      3         
  1201 blue  ;green,,brown;,yellow;      4 brown   
  1201 blue  ;green,,brown;,yellow;      5         
  1201 blue  ;green,,brown;,yellow;      6 yellow  
  1201 blue  ;green,,brown;,yellow;      7         
  1202 tinker, tailor, soldier, ...      1 tinker  
  1202 tinker, tailor, soldier, ...      2 tailor  
  1202 tinker, tailor, soldier, ...      3 soldier  
  1202 tinker, tailor, soldier, ...      4 ...

You can use ANSI joins with with `LATERAL` rather than the legacy comma joins. — MT0, Nov 03 '20 at 00:35

score 2 · Answer 2 · answered Nov 01 '20 at 15:06

2

Try the reproducible example below. Preparation of the scheme:

create table prjerr as
    select 108 Name, 'test' Project, 'string-1 , string-2 ; string-3' Error  from dual
    union all
    select 109, 'test2', 'single string' from dual
/
create or replace type tokenList is table of varchar2 (32767)
/

Function implementation:

create or replace function csvstr2tab (
        str varchar2, delimiter char := '\s*[,;]\s*') return tokenList is
    pattern constant varchar2 (64) := '(.*?)(('||delimiter||')|($))';
    tokens tokenList := tokenList ();
    s varchar2 (96);
    c int := 0;
begin 
    <<split>> loop c := c + 1;  
        s := regexp_substr (str, pattern, 1, c, null, 1);
        exit split when s is null; 
        tokens.extend;
        tokens(tokens.last) := s;
    end loop;
    return tokens;
end csvstr2tab;
/

Usage of the function and the result:

select distinct name, project, t.column_value error
from prjerr p, csvstr2tab (p.error) t 
order by name
/
      NAME PROJE ERROR           
---------- ----- ----------------
       108 test  string-1        
       108 test  string-2        
       108 test  string-3        
       109 test2 single string

PS It was tested on version 12.2.0.1.0

answered Nov 01 '20 at 15:06

0xdb

3,539
1
21
37

In my environment this solution was the fastest - thanks again! – MaxU - stand with Ukraine Nov 03 '20 at 09:47
@MaxU - This solution is the fastest because the table function is not pipelined. There are several important considerations re: pipelining. First, a table function that is not pipelined will consume a lot of memory if you have a lot of data (perhaps making OTHER queries much slower). Second, pipelining allows "streamed processing" - for example, if the table function is used in a join to other tables, the joining can begin right away, rather than waiting for the entire table to be generated from the table function. So, "fastest" really depends on how you are consuming the table. – Nov 03 '20 at 16:52
@MaxU - and, of course, if speed is important, and the data is residing in the DB already, then the fastest way to get your desired result is to not use a function at all, but instead to do **everything** in pure SQL. I mentioned this already, in the first paragraph in my answer. – Nov 03 '20 at 16:53
@mathguy, thank you for your comments! In my case there will be only one technical user, using this query (ETL), but this "trick" will be used multiple times for different columns. Because of that it's more convenient to use a function instead of pure SQL and on the other hand we want to speed up the ETL process. I agree that the pipeline approach might be a better choice for another use cases) – MaxU - stand with Ukraine Nov 03 '20 at 16:59

MT0 · Answer 3 · 2020-11-01T01:53:42.910

You can use REGEXP_INSTR to track the start and end of the regular expression matches so that, at each iteration, the regular expression does not need to restart matching from the beginning of the string.

CREATE FUNCTION regexp_split(
  value            IN VARCHAR2,
  regexp_separator IN VARCHAR2 DEFAULT ','
) RETURN string_list PIPELINED DETERMINISTIC
AS
  position      PLS_INTEGER := 1;
  next_position PLS_INTEGER;
BEGIN
  IF value IS NULL THEN
    RETURN;
  END IF;
  LOOP
    next_position := REGEXP_INSTR( value, regexp_separator, position, 1, 0 );
    IF next_position = 0 THEN
      PIPE ROW ( SUBSTR( value, position ) );
      EXIT;
    ELSE
      PIPE ROW ( SUBSTR( value, position, next_position - position ) );
      position := REGEXP_INSTR( value, regexp_separator, next_position, 1, 1 );
    END IF;
  END LOOP;
  RETURN;
END;
/

(Note: you can also make the function DETERMINISTIC.)

Then, for the test data:

CREATE TABLE table_name ( NAME, PROJECT, ERROR ) AS
  SELECT 108, 'test1', 'string-1, string-2 ; string-3' FROM DUAL UNION ALL
  SELECT 109, 'test2', 'single string' FROM DUAL UNION ALL
  SELECT 110, 'test3', 'ab,  ,c' FROM DUAL UNION ALL
  SELECT 111, 'test4', '1,2,;5,,,9' FROM DUAL;

You can use the function with CROSS APPLY (or a LATERAL join) to split the string:

SELECT t.name,
       t.project,
       s.COLUMN_VALUE AS error
FROM   table_name t
       CROSS APPLY TABLE( regexp_split( error, '\s*[,;]\s*' ) ) s

Which outputs:

NAME | PROJECT | ERROR        
---: | :------ | :------------
 108 | test1   | string-1     
 108 | test1   | string-2     
 108 | test1   | string-3     
 109 | test2   | single string
 110 | test3   | ab           
 110 | test3   | null         
 110 | test3   | c            
 111 | test4   | 1            
 111 | test4   | 2            
 111 | test4   | null         
 111 | test4   | 5            
 111 | test4   | null         
 111 | test4   | null         
 111 | test4   | 9

db<>fiddle here

Nice! I wasn't aware of the fourth argument to `regexp_instr` - I found it in the documentation. Very useful. A couple of notes on the solution: If the input string is NULL, I believe the function should still pipe out a row (with a NULL token). That is easy to add to the function definition. Also, the OP in a comment indicated he may find the order of tokens to be helpful. I believe the simplest way to do that is with ROWNUM, as in my solution (the EDIT) - at the cost of making the cross join a bit more complicated. ROWNUM should work here, because the function only pipes one row at a time. — , Nov 03 '20 at 00:03
@mathguy There are multiple ways to return a `NULL` row for a `NULL` input; one is to use a `LEFT OUTER JOIN TABLE( regexp_split( .... ) ) ON ( 1 = 1 )` then you don't need to change the function to pipe a row. There are already examples of various joins, including that one, in the linked db<>fiddle and also examples of how to return the index of the matched sub-string. I didn't feel that I needed to overload the answer with lots of detail that was irrelevant to the text of the question the OP asked. — MT0, Nov 03 '20 at 00:39
The OP initially asked for a function, regardless of joins. I believe the function itself should return a null row when the input is a null string (coming from any outside source, not necessarily from a stored table). Of course, only the OP really knows what he would want in that case. — , Nov 03 '20 at 01:10
Either solution is simple; just add `PIPE ROW ( NULL );` before the the first `RETURN` if a `NULL` row is required for a `NULL` input. — MT0, Nov 03 '20 at 02:24

score 1 · Answer 4 · answered Nov 02 '20 at 13:12

If you have APEX installed in your database, there is a function named APEX_STRING.SPLIT that does exactly what you are looking for. You can pass a single character or a regex expression that can be used to split the string. There is also an overloaded version of the function so that the same call can be used to split a VARCHAR2 or a CLOB.

WITH
    test_data (NAME, PROJECT, ERROR)
    AS
        (SELECT 108, 'test', 'string-1, string-2 ; string-3' FROM DUAL
         UNION ALL
         SELECT 109, 'test2', 'single string' FROM DUAL
         UNION ALL
         SELECT 110, 'test3', 'ab,  ,c' FROM DUAL)
SELECT name,
       project,
       error,
       TRIM (s.COLUMN_VALUE) as split_value
  FROM test_data td, TABLE (apex_string.split (error, '[,;]')) s;


   NAME    PROJECT                            ERROR      SPLIT_VALUE
_______ __________ ________________________________ ________________
    108 test       string-1, string-2 ; string-3    string-1
    108 test       string-1, string-2 ; string-3    string-2
    108 test       string-1, string-2 ; string-3    string-3
    109 test2      single string                    single string
    110 test3      ab,  ,c                          ab
    110 test3      ab,  ,c
    110 test3      ab,  ,c                          c

Thank you for your answer, but I won't have APEX available in my production environment. But I would leave this answer here for those who will come here from the search engines as they could have an APEX available... — MaxU - stand with Ukraine, Nov 02 '20 at 15:17

How create a function for converting a CSV value into a table, using regular expression as a CSV separator?

4 Answers4