POSIX ERE Regular expression to find repeated substring

Question

I have a set of strings containing a minimum of 1 and a maximum of 3 values in a format like this:

123;456;789
123;123;456
123;123;123
123;456;456
123;456;123

I'm trying to write a regular expression so I can find values repeated on the same string, so if you have 123;456;789 it would return null but if you had 123;456;456 it would return 456 and for 123;456;123 return 123

I managed to write this expression:

(.*?);?([0-9]+);?(.*?)\2

It works in the sense that it returns null when there are no duplicate values but it doesn't return exactly the value I need, eg: for the string 123;456;456 it returns 123;456;456and for the string 123;123;123 it returns 123;123

What I need is to return only the value for the ([0-9]+) portion of the expression, from what I've read this would normally be done using non-capturing groups. But either I'm doing it wrong or Oracle SQL doesn't support this as if I try using the ?: syntax the result is not what I expect.

Any suggestions on how you would go about this on oracle sql? The purpose of this expression is to use it on a query.

SELECT REGEXP_SUBSTR(column, "expression") FROM DUAL;

EDIT:

Actually according to https://docs.oracle.com/cd/B12037_01/appdev.101/b10795/adfns_re.htm

Oracle Database implements regular expression support compliant with the POSIX Extended Regular Expression (ERE) specification.

Which according to https://www.regular-expressions.info/refcapture.html

Non-capturing group is not supported by POSIX ERE

In my opinion, you don't need a regular expression. You need to fix your data model so you are not storing lists of numbers in a delimited string. — Gordon Linoff, Oct 02 '17 at 18:13
I know that storing the values like that is not optimal, but it's not my database so I can't just "fix the data model" — Artemio Ramirez, Oct 02 '17 at 18:17
You didn't tell us what you want from the last row in your example: is `123` a repeated "string" in `123;456;123`? Separately, what if a string looks like `123;123;456;456`? What do you need returned - the FIRST substring (really, token) found in the string? — , Oct 02 '17 at 18:26
Also: are the double-quotes part of your strings, or did you just mean them as string delimiters? In Oracle the string delimiter is the single quote rather than the double quote. — , Oct 02 '17 at 18:29
So I thought this would be an interesting take (using purely regex) to do what I believe you want to do. Note that this uses PCRE regex and likely will not work in POSIX (by likely, I mean it doesn't), but, nevertheless, it does answer your question `"(?:(\d+);(?(?=\1)(?\d+)|(\d+));(?(?=\1|\2|\3)(?\d+)|(\d+)))"`. See it in use here: https://regex101.com/r/31WULs/1 — ctwheels, Oct 02 '17 at 18:30
Edited the question, `123;456;123` still contains `123` twice so expected result would be `123` as for `123;123;456;456` there can only be a maximum of 3 values so 2 different values can't be repeated on the same string — Artemio Ramirez, Oct 02 '17 at 18:30
"...there can only be a maximum of 3 values..." >> You have to be careful with requirements like that. In my experience, a business rule is only true for as long as it takes someone else to decide they need a change. — Shawn, Oct 02 '17 at 18:38
@Shawn well, I'm talking about the set of strings i currently have, even if it changes in the future it won't affect what I'm trying to do right now. — Artemio Ramirez, Oct 02 '17 at 19:09
@TheChetan definitely a more elegant expression, but as you can see the result is the same https://regex101.com/r/31WULs/2 what I need is for the expression to only return the value captured by group 1 — Artemio Ramirez, Oct 02 '17 at 19:34
According to https://regular-expressions.mobi/oracle.html?wlr=1, you can select the value of a particular group using `regex_substr`. Use the two together. — TheChetan, Oct 03 '17 at 02:42

score 1 · Answer 1 · answered Oct 02 '17 at 18:37

1

If you only have three substrings, then you can use a brute force method. It is not particularly pretty, but it should do the job:

select (case when val1 in (val2, val3) then val1
             when val2 = val3 then val2
        end) as repeated
from (select t.*,
             regexp_substr(col, '[^;]+', 1, 1) as val1,
             regexp_substr(col, '[^;]+', 1, 2) as val2,
             regexp_substr(col, '[^;]+', 1, 3) as val3
      from t
     ) t
where val1 in (val2, val3) or val2 = val3;

answered Oct 02 '17 at 18:37

Gordon Linoff

1,242,037
58
646
786

Right now I'm mostly curious as to wether it's possible to solve this with only a regular expression, but yeah it had to be bruteforced due to time constrains – Artemio Ramirez Oct 02 '17 at 19:06
I'm afraid your regex will fail if there is a NULL element. Please use this form instead: `regexp_substr(col, '(.*?)(;|$)', 1, 1, NULL, 1)`. See here for proof: https://stackoverflow.com/questions/31464275/split-comma-separated-values-to-columns-in-oracle/31464699#31464699 – Gary_W Oct 02 '17 at 20:09
@Gary_W . . . It seems to work for me with empty values: http://rextester.com/EKIXZ93706. – Gordon Linoff Oct 03 '17 at 01:36
I stand corrected (for this example anyway)! I will study your example, but would still suggest changing the regex for reasons in the link provided. It still works :-) – Gary_W Oct 03 '17 at 14:04

TheChetan · Accepted Answer · 2017-10-03T14:17:04.303

1

This answer describes how to select a matching group from a regex. So using that,

SELECT regexp_substr(column, '(\d{3}).*\1', 1, 1, NULL, 1) from dual;
#                                                       ^ Select group 1

Working demo of the regex (courtesy: OP).

edited Oct 03 '17 at 14:17

answered Oct 03 '17 at 04:32

TheChetan

4,440
3
32
41

Be careful, this matches 123 in '123;456;123456' – Gary_W Oct 03 '17 at 16:08
I didn't know regexp_substr could take more parameters, that's exactly what I was trying to accomplish and I dun think I can find a better solution. Ty – Artemio Ramirez Oct 04 '17 at 05:17
@Gary_W in this scenario that's not an issue since all values separated by `;` have the same lenght, however if you could have values of different lengths, this expression would solve that issue: `((^|;)(\d+)).*\1(;|$)` – Artemio Ramirez Oct 04 '17 at 05:37
@ArtemioRamirez, actually your regex still matches `123;456;4123`, what you need is a `\b` before `\1`, Im not sure if its supported in this. – TheChetan Oct 04 '17 at 05:42
@Gary_W got me there again, I'm not sure how exactly \b works but it probably can be fixed in a different way. I'll try – Artemio Ramirez Oct 04 '17 at 05:58
@Gary_W got it `((^|;)(\d+));?(.*;)?\1(;|$)` – Artemio Ramirez Oct 04 '17 at 06:03
Now I realized that both `((^|;)(\d+)).*\b\1(;|$)` and `((^|;)(\d+));?(.*;)?\1(;|$)` match `1234;123;456` which is wrong – Artemio Ramirez Oct 04 '17 at 06:13
`(\b\d+\b).*\1` will work if `\b` works. `\b` is word break, it looks for word endings – TheChetan Oct 04 '17 at 06:16
@TheChetan that one doesn't work though (https://regex101.com/r/31WULs/3) but this one seems to work on any combination i've tried `((^|;)(\d+))(;.*)?;\3(;|$)` (https://regex101.com/r/31WULs/4) – Artemio Ramirez Oct 04 '17 at 06:19
Yeah that one works (https://regex101.com/r/31WULs/5) but, why does it work? what makes something qualify as a word? – Artemio Ramirez Oct 04 '17 at 06:26
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/155873/discussion-between-thechetan-and-artemio-ramirez). – TheChetan Oct 04 '17 at 06:45

Gary_W · Answer 3 · 2017-10-04T15:26:30.863

Please bear with me and think of this different approach. Look at the problem a little differently and break it down in a way that gives you more flexibility in how you you are able look at the data. It may or may not apply to your situation, but hopefully should be interesting to keep in mind that there are always different ways to approach a problem.

What if you turned the strings into rows so you could do standard SQL against them? That way you could not only count elements that repeat but perhaps apply aggregate functions to look for patterns across sets or something.

Consider this then. The first Common Table Expression (CTE) builds the original data set. The second one, tbl_split, turns that data into a row for each element in the list. Uncomment the select that immediately follows to see. The last query selects from the split data, showing the count of how often the element occurs in the id's data. Uncomment the HAVING line to restrict the output to those elements that appear more than one time for the data you are after.

With the data in rows you can see how other aggregate functions could be applied to slice and dice to reveal patterns, etc.

SQL> with tbl_orig(id, str) as (
     select 1, '123;456;789' from dual union all
     select 2, '123;123;456' from dual union all
     select 3, '123;123;123' from dual union all
     select 4, '123;456;456' from dual union all
     select 5, '123;456;123' from dual
   ),
   tbl_split(id, element) as (
   select id,
          regexp_substr(str, '(.*?)(;|$)', 1, level, NULL, 1) element
   from tbl_orig
   connect by level <= regexp_count(str, ';')+1
   and prior id = id
   and prior sys_guid() is not null
   )
   --select * from tbl_split;
   select distinct id, element, count(element)
   from tbl_split
   group by id, element
   --having count(element) > 1
   order by id;

        ID ELEMENT     COUNT(ELEMENT)
---------- ----------- --------------
         1 123                      1
         1 456                      1
         1 789                      1
         2 123                      2
         2 456                      1
         3 123                      3
         4 123                      1
         4 456                      2
         5 123                      2
         5 456                      1

10 rows selected.

SQL>

You are right in that it's always nice to evaluate different approaches to things that may let you see things you might be missing. For my particular scenario though the answer I already marked still fita best. As I didn't need nor want to manipulate the data any further than identifying the rows with duplicate values in them and which value was the duplicate — Artemio Ramirez, Oct 04 '17 at 18:14
See my edit at the bottom. I'm saying this approach is another that can answer your question while setting a framework for looking at the data in different ways too. Just a approaching the problem a different way. — Gary_W, Oct 04 '17 at 18:54

POSIX ERE Regular expression to find repeated substring

3 Answers3