4

I have regular expression

(IX|IV|V?I{0,3}|M{1,4}|CM|CD|D?C{1,3}|XC|XL|L?X{1,3})

I use it to detect if there is any roman number in text.

eregi("( IX|IV|V?I{0,3}[\.]| M{1,4}[\.]| CM|CD|D?C{1,3}[\.]| XC|XL|L?X{1,3}[\.])", $title, $regs)

But format of roman number is always like this: " IV."... I have added in eregi example white space before number and "." after number but I still get the same result. If text is something like "somethinvianyyhing" the result will be vi (between both)...

What am I doing wrong?

kapa
  • 77,694
  • 21
  • 158
  • 175
M.V.
  • 1,662
  • 8
  • 32
  • 55

1 Answers1

2

You have no space before VI the space belongs always to the alternative before it was written and not to all. The same for the \. it belongs always to the alternative where it was written.

Try this

" (IX|IV|V?I{0,3}|M{1,4}|CM|CD|D?C{1,3}|XC|XL|L?X{1,3})\."

See it here on Regexr

This will match

I.
II.
III.
IV.
V.
VI.
VII.
VIII.
IX.
X.

But not

XI. MMI. MMXI.
somethinvianyyhing

Your approach to match roman numbers is far from being correct, an approach to match the roman numbers more correct is this, for numbers till 50 (L)

^(?:XL|L|L?(?:IX|X{1,3}|X{0,3}(?:IX|IV|V|V?I{1,3})))$

See it here on Regexr

I tested this only on the surface, but you see this will really get complex and in this expression C, D and M are still missing.

Not to speak about special cases for example 4 = IV = IIII and there are more of them.

Wikipedia about Roman numbers

stema
  • 90,351
  • 20
  • 107
  • 135
  • Also on SO: http://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression – kapa Aug 24 '11 at 07:11
  • 2
    There is a Perl module that handles Roman numerals correctly. The way to know you have one is to *first* match `/\b([ivxldcm]+)\b/i` and *then* check whether `Roman::isroman($1)` returns true. Otherwise you get wrong answers. It only works on ASCII, which means it only goes up to 4000. The longest such legal string is `MMMDCCCLXXXVIII`. With Unicode, you can go much much higher, since you have the bigger Roman numerals like ↂ for 10,000 *&c* — and you also have the macron or overline for a 1000 times the base char. I have a module that handles all these, but it is of course written in Perl.☺ – tchrist Aug 24 '11 at 22:47