Please correct me if I'm wrong [about $NONBREAKING_PREFIX{$pre} && $NONBREAKING_PREFIX{$pre}==1
checking] whether the prefix is in a list of nonbreaking prefixes
Can't tell without knowing what %NONBREAKING_PREFIX
contains, but it's a fair guess.
Please correct me if I'm wrong [about $i<scalar(@words)-1 && ($words[$i+1] =~ /^[\p{IsLower}]/)
checking] whether the word is not the last token and there is still a lowercased token as the next word
Assuming the code is iterating over @words
, and $i
is the index of the current word, then it checks if the current word is followed by a word that starts with a lowercase letter (as defined by Unicode).
Is the $pre =~ /\./
checking whether the prefix is a single fullstop?
Not quite. It checks if any of the characters in the string in $pre
is a FULL STOP.
$ perl -e'CORE::say "abc.def" =~ /\./ ? "match" : "no match"'
match
$ perl -e'CORE::say "abc!def" =~ /\./ ? "match" : "no match"'
no match
Perl first tries to find a match at position 0, then at position 1, etc, until it finds a match.
And is $pre =~ /\p{IsAlpha}/ checking whether the prefix is an alpha from the list of alphabet in the perluniprop?
\p{IsAlpha}
is indeed defined in perluniprops. [Note the correct spelling.] It defines
\p{Is_*} ⇒ \p{*}
\p{Alpha} ⇒ \p{XPosixAlpha}
\p{XPosixAlpha} ⇒ \p{Alphabetic=Y}
\p{Alpha: *} ⇒ \p{Alphabetic=*}
\p{Alphabetic} ⇒ \p{Alphabetic=Y}
so \p{IsAlpha}
is an alias for \p{Alphabetic=Y}
[1]. Unicode defines what characters are Alphabetic[2]. There are quite a few:
$ unichars '\p{Alpha}' | wc -l
10391
So back to the question. $pre =~ /\p{IsAlpha}/
checks if any of the characters in the string in $pre
is an alphabetic character.
One related question is whether the fullstop is already inside the perluniprop alphabet?
No.
$ perl -e'CORE::say "." =~ /\p{IsAlpha}/ ? "match" : "no match"'
no match
$ uniprops .
U+002E <.> \N{FULL STOP}
\pP \p{Po}
All Any ASCII Assigned Basic_Latin Punct Is_Punctuation Case_Ignorable CI Common Zyyy Po P
Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Other_Punctuation Pat_Syn Pattern_Syntax
PatSyn POSIX_Graph POSIX_Print POSIX_Punct Print X_POSIX_Print Punctuation STerm Term
Terminal_Punctuation Unicode X_POSIX_Punct
In contrast,
$ uniprops a
U+0061 <a> \N{LATIN SMALL LETTER A}
\w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
AHex POSIX_XDigit All Alnum X_POSIX_Alnum Alpha X_POSIX_Alpha Alphabetic Any ASCII
ASCII_Hex_Digit Assigned Basic_Latin ID_Continue Is_IDC Cased Cased_Letter LC
Changes_When_Casemapped CWCM Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
Gr_Base Grapheme_Base Graph X_POSIX_Graph GrBase Hex X_POSIX_XDigit Hex_Digit IDC ID_Start
IDS Letter L_ Latin Latn Lowercase_Letter Lower X_POSIX_Lower Lowercase PerlWord POSIX_Word
POSIX_Alnum POSIX_Alpha POSIX_Graph POSIX_Lower POSIX_Print Print X_POSIX_Print Unicode Word
X_POSIX_Word XDigit XID_Continue XIDC XID_Start XIDS
If so, wouldn't this condition never be true?
$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a
no match
$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' .
no match
$ perl -E'CORE::say /\./ && /\p{IsAlpha}/ ? "match" : "no match" for $ARGV[0]' a.
match
Underscores and spaces are ignored, so \p{IsAlpha}
, \p{Is_Alpha}
and \p{I s_A l p_h_a}
are all equivalent.
The list of alphabetic characters is slightly different than the list of letter characters.
$ unichars '\p{Letter}' | wc -l
9540
$ unichars '\p{Alpha}' | wc -l
10391
All letters are alphabetic, but so are some alphabetic marks, roman numerals, etc.