6

I want to catch roman numbers inside string (numbers below 80 is fine enough). I found good base for it in How do you match only valid roman numerals with a regular expression?. Problem is: it deals with whole strings. I did not found yet a solution how to detect roman numbers inside string, because there is nothing mandatory, every group may be optional. So far i tried something like this:

my $x = ' some text I-LXIII iv more ';

if (  $x =~  s/\b(
                    (
                        (XC|XL|L?X{0,3}) # first group 10-90
                    |
                        (IX|IV|V?I{0,3}) # second group 1-9
                    )+
            )
        \b/>$1</xgi ) { # mark every occurrence
    say $x;
}

__END__
 ><some>< ><text>< ><>I<><-><>LXIII<>< ><>iv<>< ><more>< 
 desired output:
  some text >I<->LXIII< >iv< more 

So, this one captures word boundaries by themself too, because all groups are optional. How to get it done? How to make one of those 2 groups mandatory while there is no possible to tell which one is mandatory? Other approaches to catch romans are welcome too.

Community
  • 1
  • 1
w.k
  • 8,218
  • 4
  • 32
  • 55
  • Generally, to say `a` or `b` or `ab`, but not nothing, you can do `(a|b|ab)` or `(ab?|b)`, but you will not get around duplication. – Martin Ender Oct 18 '12 at 08:15
  • Problem: `a` or `b` themself consist of 4 optional blocks. To cover all those combinations seems pretty crazy. – w.k Oct 18 '12 at 08:18
  • Ah right, I see your point. Does Perl support look aheads? You could add a lookahead to beginning of the match (after the boundary): `(?=[IVXLDCM])` – Martin Ender Oct 18 '12 at 08:20
  • If i have any word beginning with those letters, those boundaries will captured too. Or i did wrong way, not too familiar with lookaheads. Maybe you could try regex and present your code as answer too? – w.k Oct 18 '12 at 08:28
  • Sorry, I don't know the first thing about Perl (which is why I posted the suggestion only as a comment). However, looking at your regex again, I think the only problem is, that `L?X{0,3}` and it's counterpart can be empty. Expand these to `L?X{1,3}|L` and `V?I{1,3}|V`. Then the `+` at the end should make sure that you don't get empty strings between boundaries. – Martin Ender Oct 18 '12 at 08:35
  • `X` or `I` can't be mandatory, setting them `{1,3}` makes them mandatory. My questions is not very Perl-related, general regex solution will work too. – w.k Oct 18 '12 at 08:54

2 Answers2

4

You can use Roman CPAN module

use Roman;

my $x = ' some text I-LXIII VII XCVI IIIXII iv more ';
if (  $x =~  
    s/\b
    (
        [IVXLC]+
    )
    \b
    /isroman($1) ? ">$1<" : $1/exgi ) {
    say $x;
}

output:

some text >I<->LXIII< >VII< >XCVI< IIIXII >iv< more 
Toto
  • 89,455
  • 62
  • 89
  • 125
2

This is where Perl lets us down with its missing \< and \> (beginning and end word boundary) constructs that are available elsewhere. A pattern like \b...\b will match even if the ... consumes none of the target string because the second \b will happily match the beginning word boundary a second time.

However an end word boundary is just (?<=\w)(?!\w) so we can use this instead.

This program will do what you want. It does a look-ahead for a string of potential Roman characters enclosed in word boundaries (so we must be at a beginning word boundary) and then checks for a legal Roman number that isn't followed by a word character (so now we're at an end word boundary).

Note that I've reversed your >...< marks as they were confusing me.

use strict;
use warnings;

use feature 'say';

my $x = ' some text I-LXIII iv more ';

if ( $x =~ s{
    (?= \b [CLXVI]+ \b )
    (
      (?:XC|XL|L?X{0,3})?
      (?:IX|IV|V?I{0,3})?
    )
    (?!\w)
    }
    {<$1>}xgi ) {

    say $x;
}

output

some text <I>-<LXIII> <iv> more 
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • You use in code `(?!\w)` as end boundary, but earlier you define it as `(?<=\w)(?!\w)`. Is this just a typo or i miss something here? – w.k Oct 18 '12 at 13:30
  • @w.k: What we're doing is finding a string of *word* characters made up entirely of the Roman letters, and then making sure it is a valid Roman number. The `(?!\w)` is there to make sure that *all* of this string is a valid Roman number instead of just the first few characters. If we had `LXIC`, for instance, then only `LXI` would be valid, and `(?!\w)` wouldn't match because `C` is a word character. Adding `(?<=\w)` only serves to prevent the boundary between two non-word characters also matching, and that never occurs here – Borodin Oct 19 '12 at 21:11