Utf8 correct regex for CamelCase (WikiWord) in perl

Question

Here was a question about the CamelCase regex. With the combination of tchrist post i'm wondering what is the correct utf-8 CamelCase.

Starting with (brian d foy's) regex:

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

and modifying to:

/
    \b          # start at word boundary
    \p{Uppercase_Letter}     # start with upper
    \p{Alphabetic}*          # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter}   ### next bit is lower, any zero or more, ending with upper
          |                  # or 
       \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter}   ### next bit is upper, any zero or more, ending with lower
    )

    \p{Alphabetic}*          # anything that's left
    \b          # end at word 
/x

Have a problem with lines marked '###'.

In addition, how to modify the regex when assuming than numbers and the underscore are equivalent to lowercase letters, so W2X3 is an valid CamelCase word.

Updated: (ysth comment)

for the next,

any: mean "uppercase or lowercase or number or underscore"

The regex should match CamelWord, CaW

start with uppercase letter
optional any
lowercase letter or number or underscore
optional any
upper case letter
optional any

Please, do not mark as duplicate, because it is not. The original question (and answers too) thought only ascii.

That is a really bizarre regex that you've started with; I don't think it matches anything differently than the simpler `/\b[A-Z]+[a-z][A-Za-z]*\b/` (a "word" composed only of letters, starting with a capital letter and including at least one lower case letter) (update: I'm wrong, the original regex required at least three letters.) — ysth, Jun 12 '11 at 16:25
in any case, please don't start with an ASCII regex; start with as precise as possible a definition of what you want to match — ysth, Jun 12 '11 at 16:29
updated the question - with (i hope enough) precise definition — clt60, Jun 12 '11 at 17:02
Nit: You say UTF-8 when you mean Unicode. UTF-8 is a way of storing text into bytes, but your regex is clearly meant to work on text. — ikegami, Jun 12 '11 at 21:29
That's not really my regex. [j_random_hacker came up with that](http://stackoverflow.com/questions/815787/what-perl-regex-can-match-camelcase-words/816598#816598), although I later modified it with the /x switch. — brian d foy, Jun 18 '11 at 17:39

score 5 · Accepted Answer · answered Jun 12 '11 at 18:19

5

I really can’t tell what you’re trying to do, but this should be closer to what your original intent seems to have been. I still can’t tell what you mean to do with it, though.

m{
    \b
    \p{Upper}      #  start with uppercase code point (NOT LETTER)

    \w*            #  optional ident chars 

    # note that upper and lower are not related to letters
    (?:  \p{Lower} \w* \p{Upper}
      |  \p{Upper} \w* \p{Lower}
    )

    \w*

    \b
}x

Never use [a-z]. And in fact, don’t use \p{Lowercase_Letter} or \p{Ll}, since those are not the same as the more desirable and more correct \p{Lowercase} and \p{Lower}.

And remember that \w is really just an alias for

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]

answered Jun 12 '11 at 18:19

tchrist

78,834
30
123
180

3

Why is `Lowercase` and `Lower` more desirable? (i.e. what do they include that `Ll` doesn't?) What's the difference between `Lowercase` and `Lower` (if any)? – ikegami Jun 12 '11 at 21:32
3

@ikegami: `Lowercase` and `Lower` are the same, being the union of `GC=Lowercase_Letter` with `Other_Lowercase=True`. There are 201 code points that are either ❶ `Lower` *but not* `GC=Ll`, or else ❷ `Upper` *but not* `GC=Lu`. These include `GC=Mn`, `GC=Lm`, `GC=Nl`, and `GC=So` code points. ***Sorry, I’d honestly thought this was all common knowledge by now!*** Run `unichars -gs '/(?= \P{Ll} ) \p{Lower} /x || / (?= \P{Lu} ) \p{Upper} /x' | ucsort --upper-before-lower | cat -n | less -r` to see what I mean. Those programs are in my [unicode toolchest](http://training.perl.com/scripts/). – tchrist Jun 12 '11 at 23:36
@tchrist - the link to unicode toolset is dead (at least now). Any replacement? – clt60 May 15 '14 at 15:09

Utf8 correct regex for CamelCase (WikiWord) in perl

1 Answers1

Linked