0

I have a SQL Query containing a

REGEXP_REPLACE(LISTAGG(foo.name, ',') WITHIN GROUP (ORDER BY foo.name), '([^,]+)(,\1)+', '\1') AS bar

in its SELECT. LISTAGG concatenates the values of the foo.name column with a ',' as separator, whereas REGEXP_REPLACE replaces duplicates.

When I change the LISTAGG so that a ", " (comma followed by a whitespace character) is used as separator, how do I have to adjust the REGEXP_REPLACE?

Edit:

When changing the REGEXP_REPLACE to '([^,]+)(, \1)+' it seems to work and I get

CITRONENSÄURE, KALIUMSORBAT, PEKTIN

But when changing the REGEXP_REPLACE to '^([^,]+)(, \1)+$' I still have duplicates:

CITRONENSÄURE, CITRONENSÄURE, CITRONENSÄURE, CITRONENSÄURE, CITRONENSÄURE, KALIUMSORBAT, KALIUMSORBAT, KALIUMSORBAT, KALIUMSORBAT, KALIUMSORBAT, PEKTIN, PEKTIN, PEKTIN, PEKTIN, PEKTIN

So, is '([^,]+)(, \1)+' the right one, regardless of having no ^ and $?

Markus
  • 1,222
  • 1
  • 13
  • 26

1 Answers1

1

Exactly what you did in LISTAGG, add a space after the comma

'([^,]+)(, \1)+'

Though you should make sure it matches the whole string, anchoring at the beggining of the string (^) and at the end ($).

'^([^,]+)(, \1)+$'


Edit:
As for the regex syntax, to answer the edit in your question, if you want to remove duplicates within the same line, you have to make sure it's matching the entire value. For example, '([^,]+)(, \1)+' will match in 'fo[o], [o]ther'. So we'll match the trailing comma as well (or end of string).

([^,]+), (\1(, |$))+

This expression is much safer than the one you were trying to use. However, there are a few cases where it could fail.

SQL:

select regexp_replace(
    listagg("name", ', ') within group (order by "name")
    ,'([^,]+), (\1(, |$))+', '\1\3')
as "bar"
from foo;

DEMO in SQLfiddle



When could it fail?
The expression is not 100% safe because the first word is not anchored, and thus it can match at the middle of a value. For example, it will remove WORD from the list:

'AWORD, WORD, XXX'  ==>  'AWORD, XXX'
  ====--^^^^--

I don't think there's a way to avoid this, since Oracle implements POSIX EREs and does not support lookarrounds, word boundaries nor \G assertions. Moreover, this is not the proper way to remove duplicate values. With really long tables, you'll end up in ORA-01489: result of string concatenation is too long.

I'd recommend using GROUP BY or DISTINCT for that case. You can read about it in this article: SQL/mysql - Select distinct/UNIQUE but return all columns?.

Community
  • 1
  • 1
Mariano
  • 6,423
  • 4
  • 31
  • 47
  • Thanks a lot! With `'([^,]+)(, \1)+'` it works, with `'^([^,]+)(, \1)+$'` I still have duplicates. How do I have to change the second one to make it work? – Markus Sep 23 '15 at 14:36
  • Can you provide examples of your target data? It is not clear if you're dealing with 2 or more items separated by commas. Please edit your question showing edge cases. – Mariano Sep 23 '15 at 15:08
  • I've edited the question. I just want to ensure that `'([^,]+)(, \1)+'` is correct... – Markus Sep 25 '15 at 11:29
  • I believe now it should answer what you've specified in your last edit. As you can see, a properly tagged question, specifying the dbms, and providing a clear example with edge cases leads to more informed answers. Please keep this in mind if you [ask another question](http://stackoverflow.com/help/how-to-ask). – Mariano Sep 26 '15 at 09:27
  • Thanks a million for this great answer and your efforts! – Markus Sep 28 '15 at 05:52