How to create regular expression checking Roman numerals?

Question

I need to create regular expression which verifies if user inputs:

4 digits OR
value like XXXXXX-YY, where X is roman numerals from I to XXXIII and YY is two latin characters (A-Z)

See this other question: http://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression — satoshi, Feb 18 '12 at 11:43
@RobW, it can be from 1 to 6 characters since expected value is from I to XXXIII (i.e. from 1 to 33). — LA_, Feb 18 '12 at 12:09

Rob W · Accepted Answer · 2012-02-18T14:05:53.087

According to the requirements, these are possible roman number-formats. For readability, only the maximum number of X is shown.

XXX III     (or: <empty>, I or II instead of III)
XX V       (or: IV, IX and X instead of IV)

I suggest this compact pattern:

/^(\d{4}|(?=[IVX])(X{0,3}I{0,3}|X{0,2}VI{0,3}|X{0,2}I?[VX])-[A-Z]{2})$/i

Explanation:

^                Begin of string
(                Begin of group 1.
  \d{4}             4 digits

|                 OR

  (?=[IVX])         Look-ahead: Must be followed by a I, V or X
  (                  Begin of group 2.
     X{0,3}I{0,3}       = 0 1 2 3  + { 0 ; 10 ; 20 ; 30} (roman)
  |                  OR
     X{0,2}VI{0,3}      = 5 6 7 8  + { 0 ; 10 ; 20 }     (roman)
  |                  OR
     X{0,2}I?[VX]       = 4 9      + { 0 ; 10 ; 20 }     (roman)
  )                  End of group 2
  -[A-Z]{2}          Postfixed by a hyphen and two letters
)                 End of group 1.
$                End of string

Wow, that’s pretty impressive! Requires a great deal more thinking than the [Regexp::Assemble](http://search.cpan.org/perldoc?Regexp::Assemble) technique though. — tchrist, Feb 18 '12 at 14:14

score 2 · Answer 2 · answered Feb 18 '12 at 12:27

Well the part that matches a Roman numeral between I and XXXIII is:

(?:X(?:X(?:V(?:I(?:I?I)?)?|X(?:I(?:I?I)?)?|I(?:[VX]|I?I)?)?|V(?:I(?:I?I)?)?|I(?:[VX]|I?I)?)?|V(?:I(?:I?I)?)?|I(?:[VX]|I?I)?)

As revealed by this:

#!/usr/bin/env perl
use Regexp::Assemble;
use Roman;

my $ra = new Regexp::Assemble;

for my $num (1..33) {
    $ra->add(Roman($num));
} 

print $ra->re, "\n";

pete · Answer 3 · 2012-02-18T14:23:19.950

function inputIsValid(value) {
    var r = /(^[0-9]{4}$)|(^(?:(?:[X]{0,2}(?:[I](?:[XV]?|[I]{0,2})?|(?:[V][I]{0,3})?))|(?:[X]{3}[I]{0,3}))\-[A-Z]{2}$)/ig;
    return value.match(r);
}

That will match either a 4-digit input, or a roman number (ranged 1 - 33) followed by a dash and two letters.

To explain the regex, below is an expanded source with comments:

// Test for a 4-digit number
(                                       // Start required capturing group
    ^                                   // Start of string
    [0-9]{4}                            // Test for 0-9, exactly 4 times
    $                                   // End of string
)                                       // End required capturing group
|                                       // OR
// Test for Roman Numerals, 1 - 33, followed by a dash and two letters
(                                       // Start required capturing group
    ^                                   // Start of string
    (?:                                 // Start required non-capturing group
        // Test for 1 - 29
        (?:                             // Start required non-capturing group
            // Test for 10, 20, (and implied 0, although the Romans did not have a digit, or mathematical concept, for 0)
            [X]{0,2}                    // X, optionally up to 2 times
            (?:                         // Start required non-capturing group
                // Test for 1 - 4, and 9
                [I]                     // I, exactly once (I = 1)
                (?:                     // Start optional non-capturing group
                    // IV = 4, IX = 9
                    [XV]?               // Optional X or V, exactly once
                    |                   // OR
                    // II = 2, III = 3
                    [I]{0,2}            // Optional I, up to 2 times
                )?                      // End optional non-capturing group
                |                       // OR
                // Test for 5 - 8
                (?:                     // Start optional non-capturing group
                    [V][I]{0,3}         // Required V, followed by optional I, up to 3 times
                )?                      // End optional non-capturing group
            )                           // End required non-capturing group
        )                               // End required non-capturing group
        |                               // OR
        // Test for 30 - 33
        (?:                             // Start required non-capturing group
            // Test for 30
            [X]{3}                      // X exactly 3 times
            // Test for 1 - 3
            [I]{0,3}                    // Optional I, up to 3 times
        )                               // End required non-capturing group
    )                                   // End required non-capturing group
    // Test for dash and two letters
    \-                                  // Literal -, exactly 1 time
    [A-Z]{2}                            // Alphabetic character, exactly 2 times
    $                                   // End of string
)                                       // End required capturing group

The 4-digit number and trailing \-[A-Z]{2} were (to me) self-evident. My method for the Roman Numerals was to:

Open Excel Populate a column with 1-33.
Convert that column to Roman Numerals (in all 7 different varieties).
Check to see if any of the varieties were different from 1-33 (they weren't).
Fiddled with moving the Roman Numerals into the minimum number of unique patterns that limited them to 33 (i.e, "then shalt thou count to thirty-three, no more, no less. Thirty-three shall be the number thou shalt count, and the number of the counting shall be thirty-three. Thirty-four shalt thou not count, neither count thou thirty-two, excepting that thou then proceed to thirty-three. Thirty-five is right out.")
Realized that up to thirty-nine is a single pattern (^(([X]{0,3}([I]([XV]?|[I]{0,2})?|([V][I]{0,3})?)))$, changed to capturing groups for better clarity).
Changed pattern to allow up to twenty-nine.
Added another to allow thirty to thirty-nine.
Construct the whole pattern and test in RegexBuddy (an invaluable tool for this stuff) against digits 0 - 20,000 and Roman Numerals 1 - 150 followed by "-AA".
The pattern worked, so I posted it (then grabbed another cup o' coffee and self-administered an 'atta-boy' for completing what I thought was a lovely Saturday morning challenge).

By extraneous brackets, I assume you mean the non-capturing groups (?: ... ). I use those a lot to group things (and the grouping is quite necessary here). I made them non-capturing because I do not need to capture the sub-groups, only the parent groups (and in this use case I don't think they need to actually be captured either, but it doesn't hurt to do so). By making them non-capturing, they won't create backreferences which speeds up processing (though for a single input, the time gained is negligible).

How did you construct that patterns, and what’s with all the extraneous brackets? — tchrist, Feb 18 '12 at 13:26
No, actually, I meant why did you write `[V][I]{0,3}` instead of `VI{0,3}`. Also, you used the wrong comment character: regexes require `#`. Oh wait, this is Javascript, where you are forbidden from using `/x` or `(?x)` mode. Javascript has the worst regexes of any language out there. Just horrible. The XRegExp plugin helps a bit though. — tchrist, Feb 18 '12 at 14:07
I constructed that pattern by hand. Exact method is now in the post. — pete, Feb 18 '12 at 14:23
Oh, because I tend to think in character classes when writing regular expressions. I think you're correct in that it could probably be simpler. I originally wrote those as `[vV][iI]{0,3}` and then added the case-insensitive switch afterwards (and removed the lowercase matches). — pete, Feb 18 '12 at 14:26

How to create regular expression checking Roman numerals?

3 Answers3