1

I have a list of names:

Joe
Bob
Carl
Seth Smith II
Doug IV

I am trying to write a regex expression that will return the names, but not the roman numerals. So my result set should look like:

Joe
Bob
Carl
Seth Smith
Doug

I've been looking at negative look aheads, but am pretty new to this so I'm not sure if I'm on the right track. Thank you!

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
Sean Smyth
  • 1,509
  • 3
  • 25
  • 43
  • Depends on the rest of the string. With your current examples, you could just use: `^[A-Z][a-z]+( [A-Z][a-z]+)* ?` – jhnc Jul 19 '19 at 23:54
  • 1
    Can you just trim trailing Roman numerals in your code before processing name? – Bohemian Jul 20 '19 at 03:14

1 Answers1

0
^(?:.(?! (?=[MDCLXVI])(M*)(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$))+\S?

Demo

This regular expression should work, but it might be a bit of an overkill for your use case because it checks for all possible Roman numerals following modern strict notation, including very large numbers in the range of thousands. It handles names or surnames written in capital letters that satisfy the syntax of a Roman numeral correcly, unless they appear at the very end (eg. "Jet LI") in which case they will be processed as a Roman numeral.

This was my logic:

  1. Lets match start of string, followed by one or more instances of <any character not followed by space + roman numeral + end> plus possibly one more non-space characters (the last letter of surname, which may be followed by space+roman numeral+end).
    ^(?:<any non-linebreak character not followed by space + Roman numeral + end>)+\S?
  2. <any non-linebreak character not followed by space + Roman numeral + end> is matched using this regex:
    .(?! <Roman numeral>$)
  3. And a <Roman numeral> in modern strict notation can be matches like this:
    (?=[MDCLXVI])(M*)(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})
  4. Now substitute everything together to get the final regex.

Note:

If you only want to consider Roman numerals in a certain range, update the <Roman numeral> part accordingly. Eg. for numbers smaller than twenty it would become (?=[XVI])X?(I[XV]|V?I{0,3}). The entire regex would than be:

^(?:.(?! (?=[XVI])X?(I[XV]|V?I{0,3})$))+\S?

Reference:

Roman Numerals




Update:

Here is another possible regex, which should be faster than the one above because it matches all non-spaces greedily and only checks the negative lookahead in case of spaces.

^(?:\S+| (?!(?=[IVXLCDM])(M*)(C[MD]|D?C{0,3})(X[CL]|L?X{0,3})(I[XV]|V?I{0,3})$))+

Demo

The general logic here is:

^(?:\S+| (?!<Roman numeral>$))+
Petr Srníček
  • 2,296
  • 10
  • 22