Regex to remove duplicate numbers from a string

Question

I have produced a data set with codes separated by pipe symbols. I realized there are many duplicates in each row. Here are three example rows (the regex is applied to each row individually in KNIME)

0612|0613|061|0612|0612
0211|0612|021|0212|0211|0211
0111|0111
0511|0512|0511|0511|0521|0512|0511

I am trying to build a regex that removes the duplicate code numbers from each row. I tested \b(\d+)\b.*\b\1\b from a different thread here but the expression does not keep the other codes. The desired outputs for the example rows above would be

0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521

Appreciate your help

You can't do this with a regexp. It won't do overlapping replacements. — Barmar, Jun 18 '22 at 18:37
Why does the last sample output have duplicate `0512`? Do you only care about duplicates of the first number? — Barmar, Jun 18 '22 at 18:38
Why is `0111|0111` becoming `0111` but `|0612|0612` is getting removed completely? — anubhava, Jun 18 '22 at 19:05
I don't know about knime, but with most languages you could easily do that without a regex. In Ruby, for example, `a = "0211|0612|021|0212|0211|0211".split('|') #=> ["0211", "0612", "021", "0212", "0211", "0211"]`, then `b = a.uniq #=> ["0211", "0612", "021", "0212"]`, then `b.join('|') #=> "0211|0612|021|0212"`. In fact, these operations would be combined: `"0211|0612|021|0212|0211|0211".split('|').uniq.join('|')`. The reader should be able to see how these operations could be translated to other languages. — Cary Swoveland, Jun 18 '22 at 21:27
@Cary you are right, in KNIME your proposed approach without regex will also work, although it would require more “steps” (or so-called nodes) to get the result (as KNIME basically provides a node with inputs and outputs for each operation instead of coding to perform data analytics workflows) — Anonymous, Jun 18 '22 at 23:55
@Barmar Sorry, the duplicate 0512 in the last row was a mistake, I have corrected it in my initial post — citizen4, Jun 19 '22 at 13:05

bobble bubble · Accepted Answer · 2022-06-25T12:00:11.183

3

No idea what regex engine this knime uses.

Probably you need one that supports variable length lookbehind to do it in one pass, eg .NET

\|(\d+)\b(?<=\b\1\b.*?\1)

See .NET regex demo at Regexstorm (check [•] replace matches with, click on "context")

Update: Turns out knime uses Java's pattern implementation...

In Java regex variable-width lookbehind is actually implemented, but only by use of finite repitition. The second issue is, that backreference \1 can't be used inside a lookbehind. So we'd need some trickery and put it into a lookahead which we put in the lookbehind.

Let's assume a maximum potential distance of 999 characters between duplicates and each field can contain up to 9 digits (adjust these values to your needs).

\|(\d+)\b(?<=\b(?=\|?\1\b).{1,999}?\|\d{1,9})

Java regex demo at Regex101 (explanation on right side)

0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521

With only a lookahead you can get unique rows too, but vice versa (not like your desired results)

\b(\d+)\|(?=.*?\b\1\b)

Another demo on Regex101

0613|061|0612
0612|021|0212|0211
0111
0521|0512|0511

For further information have a look into the Stackoverflow Regex FAQ.

edited Jun 25 '22 at 12:00

answered Jun 18 '22 at 18:40

bobble bubble

16,888
3
27
46

1

Great to see you posting answers again :-) – The fourth bird Jun 18 '22 at 18:50
1

Thank you 4th bird, always happy reading from you:) – bobble bubble Jun 18 '22 at 18:51
1

Thanks for the solutions. KNIME uses Java's pattern implementation. Here is the documentation for the relevant nodes: https://nodepit.com/node/org.knime.base.node.preproc.stringreplacer.StringReplacerNodeFactory https://nodepit.com/node/ws.palladian.nodes.extraction.regex.RegexExtractorNodeFactory It looks like I cannot implement the substitution. The string replacer only allows me to replace all occurrences of a substr matching the pattern. The regex extractor is neat, makes capturing group columns but doesn't allow substitution either. At least I couldn't get it done using these nodes ... – citizen4 Jun 20 '22 at 20:43
@citizen4 Thanks for your comment! See my updated answer. Maybe [this demo](https://regex101.com/r/P8aHin/2) works for you. – bobble bubble Jun 20 '22 at 22:41
1

@bobblebubble Awesome! Using the String Replacer node with regex pattern type and "replace all occurrences" selected, I finally get the desired result in KNIME with the updated expression :) thank you – citizen4 Jun 21 '22 at 10:42
You're welcome! Funnily I just was doing some little edit at the same time (made the quantifier `{1,999}` [lazy](https://stackoverflow.com/a/2301291/5527985) for performance reasons by attaching a `?`. Glad that trickery works for you. Adjust the values 999/9 if not sufficient, but I think they are broad enough, aren't they. – bobble bubble Jun 21 '22 at 11:00
1

Yes, that is sufficient because even the longest row has less than 150 char. The codes only have 3 or 4 digits, I adapted the expression accordingly ;) – citizen4 Jun 21 '22 at 11:10

score 2 · Answer 2 · answered Jun 18 '22 at 19:15

2

Based on the expected output shown, you can use this regex:

(\|\d+)\1(?:((?:\|\d+)*)\1)?(?=\||$)|^(\d+)\|(?=\3\b)

Replacement string is: $2

RegEx Demo

answered Jun 18 '22 at 19:15

anubhava

761,203
64
569
643

Regex to remove duplicate numbers from a string

2 Answers2