3

I have trouble matching multiple groups, some of which are optional. I've tried variations of greedy/non greedy, but can't get it to work.

As input, I have cells which look like this:

SEPA Overboeking                 IBAN: AB1234        BIC: LALA678                    Naam: John Smith            Omschrijving: Hello hello        Kenmerk: 03-05-2019 23:12 533238

I wanna split these up into groups of IBAN, BIC, Naam, Omschrijving, Kenmerk.

For this example, this yields: AB1234; LALA678; John Smith; Hello hello; 03-05-2019 23:12 533238. To obtain this, I've used:

.*IBAN: (.*)\s+BIC: (.*)\s+Naam: (.*)\s+Omschrijving: (.*)\s+Kenmerk: (.*)

This works perfectly as long as all these groups are present in the input. Some cells, however don't have the "Omschrijving" and/or "Kenmerk" part. As output, I would like to have empty groups if they're not present. Right now, nothing is matched. I've tried variations with greedy/non greedy, but couldn't get it to work.

Help would be greatly appreciated!

N.B.: I'm working in KNIME (open source data analysis tool)

Cobra
  • 73
  • 5

1 Answers1

4

I was able to split your input using the following regular expression:

^.*
\s+IBAN\:\s*(?<IBAN>.*?)
\s+BIC\:\s*(?<BIC>.*?)
\s+Naam\:\s*(?<Naam>.*?)
(?:\s+Omschrijving\:\s*(?<Omschrijving>.*?))?
(?:\s+Kenmerk\:\s*(?<Kenmerk>.*?))?
$

This requires your fields to follow the given order and will treat the fields IBAN, BIC and Naam as required. Fields Omschrijving and Kenmerk may be optional. I am pretty sure, this can still be optimized, but it results in the following output, which should be fine for you (or at least a starting point):

Example output results

For evaluation and testing in KNIME, I used Palladian's Regex Extractor node, that can be configured as follows and provides a nice preview functionality:

Regex Extractor configuration

I added an example workflow to my NodePit Space. It contains some example lines, parses them and provides the above seen output.

Daniel
  • 471
  • 3
  • 10
  • Wow, amazing! Thanks so much for the extensive reply. Also great tip to use the regex extractor tool, I'll check it out. What are the functions called where you start a group with a question mark? And the use of <>? Thanks! – Cobra Jun 02 '20 at 18:26
  • 1
    You are welcome! The `(?:group)` construct are "non-capturing groups" I just use to make those groups optional. The `(?group)` construct is called a "named capturing group". It allows back references by name, but is also used by the Regex Extractor node to name the extracted columns. Pretty handy and avoids using a Column Rename node afterwards. – Daniel Jun 03 '20 at 10:34