2

I have produced a data set with codes separated by pipe symbols. I realized there are many duplicates in each row. Here are three example rows (the regex is applied to each row individually in KNIME)

0612|0613|061|0612|0612
0211|0612|021|0212|0211|0211
0111|0111
0511|0512|0511|0511|0521|0512|0511

I am trying to build a regex that removes the duplicate code numbers from each row. I tested \b(\d+)\b.*\b\1\b from a different thread here but the expression does not keep the other codes. The desired outputs for the example rows above would be

0612|0613|061
0211|0612|021|0212
0111
0511|0512|0521

Appreciate your help

citizen4
  • 25
  • 4
  • *4 example rows actually – citizen4 Jun 18 '22 at 18:31
  • You can't do this with a regexp. It won't do overlapping replacements. – Barmar Jun 18 '22 at 18:37
  • 1
    Why does the last sample output have duplicate `0512`? Do you only care about duplicates of the first number? – Barmar Jun 18 '22 at 18:38
  • Why is `0111|0111` becoming `0111` but `|0612|0612` is getting removed completely? – anubhava Jun 18 '22 at 19:05
  • I don't know about knime, but with most languages you could easily do that without a regex. In Ruby, for example, `a = "0211|0612|021|0212|0211|0211".split('|') #=> ["0211", "0612", "021", "0212", "0211", "0211"]`, then `b = a.uniq #=> ["0211", "0612", "021", "0212"]`, then `b.join('|') #=> "0211|0612|021|0212"`. In fact, these operations would be combined: `"0211|0612|021|0212|0211|0211".split('|').uniq.join('|')`. The reader should be able to see how these operations could be translated to other languages. – Cary Swoveland Jun 18 '22 at 21:27
  • @Cary you are right, in KNIME your proposed approach without regex will also work, although it would require more “steps” (or so-called nodes) to get the result (as KNIME basically provides a node with inputs and outputs for each operation instead of coding to perform data analytics workflows) – Anonymous Jun 18 '22 at 23:55
  • 1
    @Barmar Sorry, the duplicate 0512 in the last row was a mistake, I have corrected it in my initial post – citizen4 Jun 19 '22 at 13:05

2 Answers2

3

No idea what regex engine this knime uses.

Probably you need one that supports variable length lookbehind to do it in one pass, eg .NET

\|(\d+)\b(?<=\b\1\b.*?\1)

See .NET regex demo at Regexstorm (check [•] replace matches with, click on "context")

Update: Turns out knime uses Java's pattern implementation...

In Java regex variable-width lookbehind is actually implemented, but only by use of finite repitition. The second issue is, that backreference \1 can't be used inside a lookbehind. So we'd need some trickery and put it into a lookahead which we put in the lookbehind.

Let's assume a maximum potential distance of 999 characters between duplicates and each field can contain up to 9 digits (adjust these values to your needs).

\|(\d+)\b(?<=\b(?=\|?\1\b).{1,999}?\|\d{1,9})

Java regex demo at Regex101 (explanation on right side)

0612|0613|061
​0211|0612|021|0212
​0111
​0511|0512|0521


With only a lookahead you can get unique rows too, but vice versa (not like your desired results)

\b(\d+)\|(?=.*?\b\1\b)

Another demo on Regex101

0613|061|0612
0612|021|0212|0211
0111
0521|0512|0511


For further information have a look into the Stackoverflow Regex FAQ.

bobble bubble
  • 16,888
  • 3
  • 27
  • 46
  • 1
    Great to see you posting answers again :-) – The fourth bird Jun 18 '22 at 18:50
  • 1
    Thank you 4th bird, always happy reading from you:) – bobble bubble Jun 18 '22 at 18:51
  • 1
    Thanks for the solutions. KNIME uses Java's pattern implementation. Here is the documentation for the relevant nodes: https://nodepit.com/node/org.knime.base.node.preproc.stringreplacer.StringReplacerNodeFactory https://nodepit.com/node/ws.palladian.nodes.extraction.regex.RegexExtractorNodeFactory It looks like I cannot implement the substitution. The string replacer only allows me to replace all occurrences of a substr matching the pattern. The regex extractor is neat, makes capturing group columns but doesn't allow substitution either. At least I couldn't get it done using these nodes ... – citizen4 Jun 20 '22 at 20:43
  • @citizen4 Thanks for your comment! See my updated answer. Maybe [this demo](https://regex101.com/r/P8aHin/2) works for you. – bobble bubble Jun 20 '22 at 22:41
  • 1
    @bobblebubble Awesome! Using the String Replacer node with regex pattern type and "replace all occurrences" selected, I finally get the desired result in KNIME with the updated expression :) thank you – citizen4 Jun 21 '22 at 10:42
  • You're welcome! Funnily I just was doing some little edit at the same time (made the quantifier `{1,999}` [lazy](https://stackoverflow.com/a/2301291/5527985) for performance reasons by attaching a `?`. Glad that trickery works for you. Adjust the values 999/9 if not sufficient, but I think they are broad enough, aren't they. – bobble bubble Jun 21 '22 at 11:00
  • 1
    Yes, that is sufficient because even the longest row has less than 150 char. The codes only have 3 or 4 digits, I adapted the expression accordingly ;) – citizen4 Jun 21 '22 at 11:10
2

Based on the expected output shown, you can use this regex:

(\|\d+)\1(?:((?:\|\d+)*)\1)?(?=\||$)|^(\d+)\|(?=\3\b)

Replacement string is: $2

RegEx Demo

anubhava
  • 761,203
  • 64
  • 569
  • 643