3

Moses Tokenizer is the tokenizer widely used in machine translation and natural language processing experiments.

There is a line of regex that checks for:

if (($pre =~ /\./ && $pre =~ /\p{IsAlpha}/) || 
   ($NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1) || 
   ($i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)))

Please correct me if I'm wrong, the 2nd and 3rd conditions are to check

  • whether the prefix is in a list of nonbreaking prefixes
  • whether the word is not the last token and there is still a lowercased token as the next word.

The question is on the first condition where it checks for:

($pre =~ /\./ && $pre =~ /\p{IsAlpha}/)
  1. Is the $pre =~ /\./ checking whether the prefix is a single fullstop?

  2. And is $pre =~ /\p{IsAlpha}/ checking whether the prefix is an alpha from the list of alphabet in the perluniprop?

  3. One related question is whether the fullstop is already inside the perluniprop alphabet? If so, wouldn't this condition never be true?

ikegami
  • 367,544
  • 15
  • 269
  • 518
alvas
  • 115,346
  • 109
  • 446
  • 738
  • 2
    They check if those things are *contained*. No, FULL STOP is not an alphabetic letter. – ikegami Feb 09 '17 at 01:56
  • Ah, now i see. So `$pre =~ /\p{IsAlpha}/` is checking whether all characters in `$pre` is in the peruniprop alphabet, right? – alvas Feb 09 '17 at 02:00
  • 2
    No, it checks if $pre *contains* a matching character, so it checks if *any* character in $pre matches. – ikegami Feb 09 '17 at 02:03
  • Thanks @ikegami, that explains! – alvas Feb 09 '17 at 02:04
  • 1
    I misspoke when I said alphabetic *letter*. A number of characters are considered alphabetic (match `\p{IsAlpha}`) but aren't letters (match `\p{Letter}`) e.g. TAI VIET VOWEL AM – ikegami Feb 09 '17 at 02:07
  • The `$str =~ /a/` returns true if there is (at least) one `a` anywhere in the string. So it's true for string `'alvas'` but not for `'hi'`. A _character class_ `/[a-z]/` matches any one lowercase letter, at least one. To match more than one thing you need a _quantifier_, like `/[a-z]+/` (matches a lowercase letter, one or more times in a row (need not be the same letter). To test for a sole thing in a string use _anchors_, `/^a$/` (`'a'` only), for example. This is how _match operator_ works in the [_scalar context_](http://perldoc.perl.org/perldata.html#Context) (and there's more to it). – zdim Feb 09 '17 at 04:51
  • Actually, you'd need `/^a\z/` to test if the string being matched is exactly `a`. – ikegami Feb 09 '17 at 05:06
  • Um, right, sorry. The `$` for a multiline string in a multiline mode (with `/m` [modifier](http://perldoc.perl.org/perlre.html#Modifiers)) isn't right. This is covered in full detail in [this post](http://stackoverflow.com/questions/32526929/difference-between-z-and-z-and-a-and-a-in-perl), and documentation is [in perlre](http://perldoc.perl.org/perlre.html#Regular-Expressions) (scroll down to "_Assertions_") – zdim Feb 09 '17 at 05:53
  • I thought that the question was mostly answered in comments, but this is clearly not so. What do you need in an answer -- the same but written up and explained well? Or is there more that the comments missed? – zdim Feb 14 '17 at 04:57
  • Yes yes, giving the bounty to @ikegami when he writes it up. So that the answer can be documented for posterity =) – alvas Feb 14 '17 at 05:05
  • OK, good cause, thank you for clarifying :) – zdim Feb 14 '17 at 05:28

1 Answers1

3

Please correct me if I'm wrong [about $NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1 checking] whether the prefix is in a list of nonbreaking prefixes

Can't tell without knowing what %NONBREAKING_PREFIX contains, but it's a fair guess.

Please correct me if I'm wrong [about $i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/) checking] whether the word is not the last token and there is still a lowercased token as the next word

Assuming the code is iterating over @words, and $i is the index of the current word, then it checks if the current word is followed by a word that starts with a lowercase letter (as defined by Unicode).

Is the $pre =~ /\./ checking whether the prefix is a single fullstop?

Not quite. It checks if any of the characters in the string in $pre is a FULL STOP.

$ perl -e'CORE::say "abc.def" =~ /\./ ? "match" : "no match"'
match

$ perl -e'CORE::say "abc!def" =~ /\./ ? "match" : "no match"'
no match

Perl first tries to find a match at position 0, then at position 1, etc, until it finds a match.

And is $pre =~ /\p{IsAlpha}/ checking whether the prefix is an alpha from the list of alphabet in the perluniprop?

\p{IsAlpha} is indeed defined in perluniprops. [Note the correct spelling.] It defines

\p{Is_*}          ⇒   \p{*}
\p{Alpha}         ⇒   \p{XPosixAlpha}
\p{XPosixAlpha}   ⇒   \p{Alphabetic=Y}

\p{Alpha: *}      ⇒   \p{Alphabetic=*}
\p{Alphabetic}    ⇒   \p{Alphabetic=Y}

so \p{IsAlpha} is an alias for \p{Alphabetic=Y}[1]. Unicode defines what characters are Alphabetic[2]. There are quite a few:

$ unichars '\p{Alpha}' | wc -l
10391

So back to the question. $pre =~ /\p{IsAlpha}/ checks if any of the characters in the string in $pre is an alphabetic character.

One related question is whether the fullstop is already inside the perluniprop alphabet?

No.

$ perl -e'CORE::say "." =~ /\p{IsAlpha}/ ? "match" : "no match"'
no match

$ uniprops .
U+002E <.> \N{FULL STOP}
    \pP \p{Po}
    All Any ASCII Assigned Basic_Latin Punct Is_Punctuation Case_Ignorable CI Common Zyyy Po P
       Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Punctuation Pat_Syn Pattern_Syntax
       PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print X_POSIX_Print Punctuation STerm Term
       Terminal_Punctuation Unicode X_POSIX_Punct

In contrast,

$ uniprops a
U+0061 <a> \N{LATIN SMALL LETTER A}
    \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
       ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
       Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
       Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
       IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
       POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
       X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS

If so, wouldn't this condition never be true?

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a
no match

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' .
no match

$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a.
match

  1. Underscores and spaces are ignored, so \p{IsAlpha}, \p{Is_Alpha} and \p{I s_A l p_h_a} are all equivalent.

  2. The list of alphabetic characters is slightly different than the list of letter characters.

    $ unichars '\p{Letter}' | wc -l
    9540
    
    $ unichars '\p{Alpha}' | wc -l
    10391
    

    All letters are alphabetic, but so are some alphabetic marks, roman numerals, etc.

ikegami
  • 367,544
  • 15
  • 269
  • 518