What does this regular expression mean in Oracle?

Question

SELECT  REGEXP_REPLACE(LISTAGG(A.ID, ',') WITHIN GROUP (ORDER BY A.ID), '([^,]+)(,\1)+', '\1')
FROM    TABLE A

I don't know what "\1" means in the above SQL. After creating a list by separating "A.ID" with commas through "LISTAGG", the purpose seems to be to remove the duplicate "A.ID", but I want to know the exact meaning.

For reference, "A.ID" is a NUMBER(4) column type. (e.g. 1111, 2222...)

The best thing for you to do here would be to test this query locally on your end. Seeing is believing. The call to `REGEXP_REPLACE` is somehow manipulating the CSV content generated by the call to `LISTAGG`. — Tim Biegeleisen, Jun 23 '22 at 10:08
Does this answer your question? [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) — Ivar, Jun 23 '22 at 10:10
@Ivar I'm new to "\1" in regex to mean a group reference. Your article has been really helpful. It was difficult to find a reference because I did not know the concept, but thank you for leaving it. — eastglow, Jun 23 '22 at 10:17
@Ivar It may be useful in general terms to explain what the individual elements of the regular expression do but it does not explain that the regular expression is inappropriate for the task the OP is using it for and will generate erroneous results. If Oracle supported positive look-ahead/behind then it might be possible to solve the problem using regular expressions but Oracle does not support them and it would still be an inefficient method of solving the OP's problem. The actual solution is to not use regular expressions. — MT0, Jun 23 '22 at 11:05

MT0 · Accepted Answer · 2022-06-23T11:35:38.777

It means that you are attempting to replace duplicates.

REGEXP_REPLACE(
  comma_separated_list,
  '([^,]+)(,\1)+',
  '\1'
)

Then:

([^,]+) will match one-or-more non-comma characters and store the value in a capturing group.
,\1 will match a comma and then the value from that first capturing group.
(,\1)+ matches a comma and then the value from that first capturing group and matches it all one-or-more times.

However, it does not work reliably. If you have the list 12,123,123,123,123,1234,4444

Then:

the first match will be:
```
12,123,123,123,123,1234,4444
^^^^^
```
at the start and replace it with just 12 giving:
```
123,123,123,123,1234,4444
^^
```
It has already gone wrong as the match did not match a complete element and you have lost the first id value.
the second match will start after the first match will skip the 3, characters and match:
```
123,123,123,123,1234,4444
    ^^^^^^^^^^^^^^^
```
and replace it with 123 giving:
```
123,1234,4444
    ^^^
```
Again, the match was wrong as it did not match a complete element and it is only coincidence that the value output is correct.
and the final match will be:
```
123,1234,4444
       ^^^
```
replacing it with just 4 and giving the output:
```
123,1234444
```
Which is very wrong as you are now missing the id 12 and have an incorrect id of 1234444.

What you should probably be doing is filtering the duplicates before aggregating.

In newer Oracle versions it is simply:

SELECT  LISTAGG(DISTINCT ID, ',') WITHIN GROUP (ORDER BY ID)
FROM    TABLE

or in older versions:

SELECT  LISTAGG(ID, ',') WITHIN GROUP (ORDER BY ID)
FROM    (
  SELECT DISTINCT id FROM TABLE
)

If you did want to use regular expressions (which will be a more inefficient than using DISTINCT) then you can double up the delimiters and ensure you always match complete elements using (,[^,]+,)\1+ but then you also need to remove the repeated delimiters after de-duplication (which makes an inefficient solution even more inefficient):

SELECT  TRIM(
          BOTH ',' FROM
          REPLACE(
            REGEXP_REPLACE(
              LISTAGG(',' || ID || ',') WITHIN GROUP (ORDER BY ID),
              '(,[^,]+,)\1+',
              '\1'
            ),
            ',,',
            ','
          )
        ) AS no_duplicates
FROM    TABLE_NAME

db<>fiddle here

What does this regular expression mean in Oracle?

1 Answers1