Regex with multiple groups, some of which are optional

Question

I have trouble matching multiple groups, some of which are optional. I've tried variations of greedy/non greedy, but can't get it to work.

As input, I have cells which look like this:

SEPA Overboeking                 IBAN: AB1234        BIC: LALA678                    Naam: John Smith            Omschrijving: Hello hello        Kenmerk: 03-05-2019 23:12 533238

I wanna split these up into groups of IBAN, BIC, Naam, Omschrijving, Kenmerk.

For this example, this yields: AB1234; LALA678; John Smith; Hello hello; 03-05-2019 23:12 533238. To obtain this, I've used:

.*IBAN: (.*)\s+BIC: (.*)\s+Naam: (.*)\s+Omschrijving: (.*)\s+Kenmerk: (.*)

This works perfectly as long as all these groups are present in the input. Some cells, however don't have the "Omschrijving" and/or "Kenmerk" part. As output, I would like to have empty groups if they're not present. Right now, nothing is matched. I've tried variations with greedy/non greedy, but couldn't get it to work.

Help would be greatly appreciated!

N.B.: I'm working in KNIME (open source data analysis tool)

You can split on 2 or more whitespaces or make some capturing groups optional. — The fourth bird, Jun 01 '20 at 11:54
Tried that too, unfortunately sometimes there's only 1 whitespace in between the elements — Cobra, Jun 01 '20 at 11:57

score 4 · Accepted Answer · answered Jun 01 '20 at 13:52

I was able to split your input using the following regular expression:

^.*
\s+IBAN\:\s*(?<IBAN>.*?)
\s+BIC\:\s*(?<BIC>.*?)
\s+Naam\:\s*(?<Naam>.*?)
(?:\s+Omschrijving\:\s*(?<Omschrijving>.*?))?
(?:\s+Kenmerk\:\s*(?<Kenmerk>.*?))?
$

This requires your fields to follow the given order and will treat the fields IBAN, BIC and Naam as required. Fields Omschrijving and Kenmerk may be optional. I am pretty sure, this can still be optimized, but it results in the following output, which should be fine for you (or at least a starting point):

For evaluation and testing in KNIME, I used Palladian's Regex Extractor node, that can be configured as follows and provides a nice preview functionality:

I added an example workflow to my NodePit Space. It contains some example lines, parses them and provides the above seen output.

Wow, amazing! Thanks so much for the extensive reply. Also great tip to use the regex extractor tool, I'll check it out. What are the functions called where you start a group with a question mark? And the use of <>? Thanks! — Cobra, Jun 02 '20 at 18:26
You are welcome! The `(?:group)` construct are "non-capturing groups" I just use to make those groups optional. The `(?group)` construct is called a "named capturing group". It allows back references by name, but is also used by the Regex Extractor node to name the extracted columns. Pretty handy and avoids using a Column Rename node afterwards. — Daniel, Jun 03 '20 at 10:34

Regex with multiple groups, some of which are optional

1 Answers1