1

Hi I have regex like this

(.*(?=\sI+)*) (.*)

But it doesn't capture groups correctly as I need.

For this example data :

  1. Vladimir Goth
  2. Langraab II Landgraab
  3. Léa Magdalena III Rouault Something
  4. Anna Maria Teodora
  5. Léa Maria Teodora II

1,2 are only correctly captured.

So what I need is

  • If there is no I+ is split by first space.
  • If after I+ there are other words first gorup should contains all to I+. So, group1 for 3rd example should be Léa Magdalena III
  • If after I+ there aren't any other words like in example 5, group1 should be capture to first space.

@Edit I+ should be replaced by roman numbers

ΩmegaMan
  • 29,542
  • 12
  • 100
  • 122
VANILKA
  • 634
  • 1
  • 13
  • 32

1 Answers1

1

If you want to support any Roman numbers you can use

^(\S+(?:.*\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)

If you need to support Roman numbers up to XX (exclusive):

^(\S+(?:.*\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\b(?= +\S))?) +(.*)

See the regex demo #1 and demo #2. Replace spaces with \h or \s in the Java code and double backslashes in the Java string literal.

Details:

  • ^ - start of string
  • ( - Group 1 start:
    • \S+ - one or more non-whitespaces
    • (?: - a non-capturing group:
      • .* - any zero or more chars other than line break chars as many as possible
      • \b - a word boundary
      • (?=[MDCLXVI]) - require at least one Roman digit immediately to the right
      • M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3}) - a Roman number pattern
      • \b - a word boundary
      • (?= +\S) - a positive lookahead that requires one or more spaces and then one non-whitespace right after the current position
    • )? - end of the non-capturing group, repeat one or zero times (it is optional)
  • ) - end of the first group
  • + - one or more spaces
  • (.*) - Group 2: the rest of the line.

In Java:

String regex = "^(\\S+(?:.*\\b(?=[MDCLXVI])M{0,4}(?:CM|CD|D?C{0,3})(?:XC|XL|L?X{0,3})(?:IX|IV|V?I{0,3})\\b(?=\\h+\\S))?)\\h+(.*)";
// Or
String regex = "^(\\S+(?:.*\\b(?=[XVI])X?(?:IX|IV|V?I{0,3})\\b(?=\\s+\S))?)\\s+(.*)";
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Yeap works perfectly. Thanks a lot ! – VANILKA Dec 09 '21 at 22:29
  • @VANILKA Now, it should work. I thought I managed to make the Roman number pattern match at least one char before, but it appears it could not match `X` number. Now, with the lookahead, it should require at least one Roman digit and it should work fine now. – Wiktor Stribiżew Dec 09 '21 at 22:44