4

Here was a question about the CamelCase regex. With the combination of tchrist post i'm wondering what is the correct utf-8 CamelCase.

Starting with (brian d foy's) regex:

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

and modifying to:

/
    \b          # start at word boundary
    \p{Uppercase_Letter}     # start with upper
    \p{Alphabetic}*          # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       \p{Lowercase_Letter}[a-zA-Z]*\p{Uppercase_Letter}   ### next bit is lower, any zero or more, ending with upper
          |                  # or 
       \p{Uppercase_Letter}[a-zA-Z]*\p{Lowercase_Letter}   ### next bit is upper, any zero or more, ending with lower
    )

    \p{Alphabetic}*          # anything that's left
    \b          # end at word 
/x

Have a problem with lines marked '###'.

In addition, how to modify the regex when assuming than numbers and the underscore are equivalent to lowercase letters, so W2X3 is an valid CamelCase word.

Updated: (ysth comment)

for the next,

  • any: mean "uppercase or lowercase or number or underscore"

The regex should match CamelWord, CaW

  • start with uppercase letter
  • optional any
  • lowercase letter or number or underscore
  • optional any
  • upper case letter
  • optional any

Please, do not mark as duplicate, because it is not. The original question (and answers too) thought only ascii.

Community
  • 1
  • 1
clt60
  • 62,119
  • 17
  • 107
  • 194
  • That is a really bizarre regex that you've started with; I don't think it matches anything differently than the simpler `/\b[A-Z]+[a-z][A-Za-z]*\b/` (a "word" composed only of letters, starting with a capital letter and including at least one lower case letter) (update: I'm wrong, the original regex required at least three letters.) – ysth Jun 12 '11 at 16:25
  • in any case, please don't start with an ASCII regex; start with as precise as possible a definition of what you want to match – ysth Jun 12 '11 at 16:29
  • updated the question - with (i hope enough) precise definition – clt60 Jun 12 '11 at 17:02
  • Nit: You say UTF-8 when you mean Unicode. UTF-8 is a way of storing text into bytes, but your regex is clearly meant to work on text. – ikegami Jun 12 '11 at 21:29
  • That's not really my regex. [j_random_hacker came up with that](http://stackoverflow.com/questions/815787/what-perl-regex-can-match-camelcase-words/816598#816598), although I later modified it with the /x switch. – brian d foy Jun 18 '11 at 17:39

1 Answers1

5

I really can’t tell what you’re trying to do, but this should be closer to what your original intent seems to have been. I still can’t tell what you mean to do with it, though.

m{
    \b
    \p{Upper}      #  start with uppercase code point (NOT LETTER)

    \w*            #  optional ident chars 

    # note that upper and lower are not related to letters
    (?:  \p{Lower} \w* \p{Upper}
      |  \p{Upper} \w* \p{Lower}
    )

    \w*

    \b
}x

Never use [a-z]. And in fact, don’t use \p{Lowercase_Letter} or \p{Ll}, since those are not the same as the more desirable and more correct \p{Lowercase} and \p{Lower}.

And remember that \w is really just an alias for

[\p{Alphabetic}\p{Mark}\p{Decimal_Number}\p{Letter_Number}\p{Connector_Punctuation}]
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • 3
    Why is `Lowercase` and `Lower` more desirable? (i.e. what do they include that `Ll` doesn't?) What's the difference between `Lowercase` and `Lower` (if any)? – ikegami Jun 12 '11 at 21:32
  • 3
    @ikegami: `Lowercase` and `Lower` are the same, being the union of `GC=Lowercase_Letter` with `Other_Lowercase=True`. There are 201 code points that are either ❶ `Lower` *but not* `GC=Ll`, or else ❷ `Upper` *but not* `GC=Lu`. These include `GC=Mn`, `GC=Lm`, `GC=Nl`, and `GC=So` code points. ***Sorry, I’d honestly thought this was all common knowledge by now!*** Run `unichars -gs '/(?= \P{Ll} ) \p{Lower} /x || / (?= \P{Lu} ) \p{Upper} /x' | ucsort --upper-before-lower | cat -n | less -r` to see what I mean. Those programs are in my [unicode toolchest](http://training.perl.com/scripts/). – tchrist Jun 12 '11 at 23:36
  • @tchrist - the link to unicode toolset is dead (at least now). Any replacement? – clt60 May 15 '14 at 15:09