0

I try to implement regex that will match romanian numbers in text. Here is my regex:

^ | \s+[xivXIV]+\s+ | $

So it mean 'Begin string or whitespace one or more times, than any of xivXIV one or more times, then whitespace one or more times or string end.'

But it seems its not work for me. F.e. i have a simple string 'xiv' and it not matched against this pattern.

EDIT: Suggested post is about how if string literal match to romanian number, instead i want to 'smart' extract those literals from text, so it should handle cases like 'visit' it should not take 'vi' but if 'ix table of contents' it should take 'ix'

EDIT 2: Thanks to all replies, the exp should be:

 \b[xivXIV]+\b

NOTE: in my part case i only need handle XIV literals (not full romanian system) thats because i need some simpler solution

igorGIS
  • 1,888
  • 4
  • 27
  • 41
  • Possible duplicate of [How do you match only valid roman numerals with a regular expression?](http://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression) – trincot Sep 18 '16 at 19:05
  • Thanks for reply, looked at this post before, it is about how to check if string match to romanian number, my qustion is how to extract it from string, F.e. the post marked as answer will not handle case 'my string IV my string' – igorGIS Sep 18 '16 at 19:09
  • Doesn't something like **(subexpression)** capture the data you need? The round-brackets are meant for capturing, AFAIK. – blackpen Sep 18 '16 at 19:11
  • Yes, but the duplicate reference answers are easy to adapt, since for an exact match there are the `^` and `$` markers. If you remove those, it will match substrings. Word breaks instead (`\b`)can make sure they are separate from other text. – trincot Sep 18 '16 at 19:11
  • The edit question you asked is a case of matching the regex at a boundary of words (**\b** ... That is an escaped small letter B). Refer [here](https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx). – blackpen Sep 18 '16 at 19:14
  • 2
    Romanians use the Arabic numbers (1,2,3... etc.) – Stilgar Sep 18 '16 at 19:28

1 Answers1

1

You can use the answer from this Q&A and adapt it so that it matches substrings embedded in other text:

The accepted answer there has this:

^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$

Replace the start/end anchors (^ and $) by word breaks (\b):

\bM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\b

Note that the simpler \b[xivXIV]+\b which you mentioned in your second question-edit would accept invalid roman numbers like:

IXI
XXXXX

and would not recognise these valid ones:

CM
LX

In a later edit of your question you wrote that you only want "to handle XIV literals (not full romanian[sic] system)". Still you could then take the corresponding part of the above mentioned regular expression to exclude the invalid combinations of those three letters:

\bX{0,3}(IX|IV|V?I{0,3})\b

NB: for case-insensitivity you would add the i modifier.

Community
  • 1
  • 1
trincot
  • 317,000
  • 35
  • 244
  • 286