Exclude a series of characters in regex via Oracle's REGEXP_SUBSTR

Question

I am trying to use Oracle's REGEXP_SUBSTR to select fields in a string.

Example:

this,,,is,,,an,,,example

Solution:

DECLARE
  field1 VARCHAR2(4000);
  field2 VARCHAR2(4000);
  field3 VARCHAR2(4000);
  field4 VARCHAR2(4000);
  separator VARCHAR2(300) := ',,,';
  lineToParse VARCHAR2(4000) := 'this,,,is,,,an,,,example';
BEGIN
  SELECT REGEXP_SUBSTR(lineToParse, '[^' || separator || ']+', 1, 1) AS part_1, REGEXP_SUBSTR(lineToParse, '[^' || separator || ']+', 1, 2) AS part_2, REGEXP_SUBSTR(lineToParse, '[^' || separator || ']+', 1, 3) AS part_3, REGEXP_SUBSTR(lineToParse, '[^' || separator || ']+', 1, 4) AS part_4
  INTO field1, field2, field3, field4
  FROM DUAL;
  DBMS_OUTPUT.PUT_LINE('Field 1: ' || field1);
  DBMS_OUTPUT.PUT_LINE('Field 2: ' || field2);
  DBMS_OUTPUT.PUT_LINE('Field 3: ' || field3);
  DBMS_OUTPUT.PUT_LINE('Field 4: ' || field4); 
END;

This works perfectly for the line above, generating:

Field 1: this
Field 2: is
Field 3: an
Field 4: example

However for the below line, it doesn't

this,,,is, a perfectly fine,,,new,,, line

This is because the second capture group should be: "is, a perfectly fine" but ends up being "is".

Output is:

Field 1: this
Field 2: is
Field 3:  a perfectly fine
Field 4: new

Reason is that the regex I am using:

[^,,,]+

Is capturing any of the characters that follow ^ instead of the sequence.

How can I adjust my regex so that the entire sequence is captured?

This needs to be compatible with Oracle 11g.

score 0 · Accepted Answer · answered May 22 '18 at 13:56

I don't think you can do a simple regex here. First, the character class you're using, [^,,,], isn't any different from [^,] - duplicate characters enclosed in square brackets don't make for duplicate characters in the string to be matched. Second, I don't think a negative match will work because Oracle regular expressions don't support lookaround.

You might try something like the following:

SELECT REGEXP_SUBSTR(lineToParse, '.+?($|' || separator || ')', 1, 1) AS part_1
     , REGEXP_SUBSTR(lineToParse, '.+?($|' || separator || ')', 1, 2) AS part_2
     , REGEXP_SUBSTR(lineToParse, '.+?($|' || separator || ')', 1, 3) AS part_3
     , REGEXP_SUBSTR(lineToParse, '.+?($|' || separator || ')', 1, 4) AS part_4
  INTO field1, field2, field3, field4
  FROM DUAL;

This will get everything up to either the separator or the end of the line in a non-greedy way. Now the only issue is that the values returned might include the separators; there are some ways to avoid that, the most straightforward of which is to use REPLACE(), but with Oracle 11 you can also use subexpressions with REGEXP_SUBSTR():

SELECT REGEXP_SUBSTR(lineToParse, '(.+?)($|' || separator || ')', 1, 1, 'c', 1) AS part_1
     , REGEXP_SUBSTR(lineToParse, '(.+?)($|' || separator || ')', 1, 2, 'c', 1) AS part_2
     , REGEXP_SUBSTR(lineToParse, '(.+?)($|' || separator || ')', 1, 3, 'c', 1) AS part_3
     , REGEXP_SUBSTR(lineToParse, '(.+?)($|' || separator || ')', 1, 4, 'c', 1) AS part_4
  INTO field1, field2, field3, field4
  FROM DUAL;

However, if lineToParse starts with the separator then you'll still have to deal with that in some way. Changing the first instance of REGEXP_SUBSTR() to this seems to work:

REGEXP_SUBSTR(lineToParse, '^(' || separator || ')?(.+?)($|' || separator || ')', 1, 1, 'c', 2) AS part_1

Hope this helps.

Thank you! I found the same solution myself. I also used the field1 := REPLACE(field1, separator, '') for each of the four fields to remove the trailing commas. — Link Marston, May 22 '18 at 17:27

Gary_W · Answer 2 · 2018-05-22T18:46:02.657

Just change your regex call to this:

REGEXP_SUBSTR(lineToParse, '(.*?)(' || separator || '|$)', 1, 1, NULL, 1) AS part_1,

This defines a group of characters followed by a group consisting of your separator or the end of the line. It matches the 1st occurrence of this group (the 4th argument) and returns the 1st group (the 6th argument).

Bonus! This regex form handles NULL list elements too, where the form '[^,]' does not!

Even better, build a function that you can put in a utility package for reuse that encapsulates this functionality and call that. That way, people not comfortable with regex can use it and if a change is ever needed you only have to change regex code in one place:

FUNCTION  GET_LIST_ELEMENT(string_in VARCHAR2, element_in NUMBER, delimiter_in VARCHAR2 DEFAULT ',') RETURN VARCHAR2 IS
    BEGIN
      if string_in is null then
        return NULL;
      else
        RETURN REGEXP_SUBSTR(string_in, '(.*?)(\' || delimiter_in || '|$)', 1, element_in, NULL, 1);
      end if;
  END GET_LIST_ELEMENT;

Then your code would look like this:

util.get_list_element(lineToParse, 1, separator) AS part_1, 
util.get_list_element(lineToParse, 2, separator) AS part_2, 
util.get_list_element(lineToParse, 3, separator) AS part_3, 
util.get_list_element(lineToParse, 4, separator) AS part_4

Exclude a series of characters in regex via Oracle's REGEXP_SUBSTR

2 Answers2