11

I am searching the following words in .todo files:

ZshTabCompletionBackward 
MacTerminalIterm

I made the following regex

[A-Z]{1}[a-z]*[A-Z]{1}[a-z]*

However, it is not enough, since it finds only the following type of words

ZshTab

In pseudo code, I am trying to make the following regex

([A-Z]{1}[a-z]*[A-Z]{1}[a-z]*){1-9}

How can you make the above regex in Perl?

brian d foy
  • 129,424
  • 31
  • 207
  • 592
Léo Léopold Hertz 준영
  • 134,464
  • 179
  • 445
  • 697

4 Answers4

23

I think you want something like this, written with the /x flag to add comments and insignificant whitespace:

/
   \b      # word boundary so you don't start in the middle of a word

   (          # open grouping
      [A-Z]      # initial uppercase
      [a-z]*     # any number of lowercase letters
   )          # end grouping

   {2,}    # quantifier: at least 2 instances, unbounded max  

   \b      # word boundary
/x

If you want it without the fancy formatting, just remove the whitespace and comments:

/\b([A-Z][a-z]*){2,}\b/

As j_random_hacker points out, this is a bit simple since it will match a word that is just consecutive capital letters. His solution, which I've expanded with /x to show some detail, ensures at least one lowercase letter:

/
    \b          # start at word boundary
    [A-Z]       # start with upper
    [a-zA-Z]*   # followed by any alpha

    (?:  # non-capturing grouping for alternation precedence
       [a-z][a-zA-Z]*[A-Z]   # next bit is lower, any zero or more, ending with upper
          |                     # or 
       [A-Z][a-zA-Z]*[a-z]   # next bit is upper, any zero or more, ending with lower
    )

    [a-zA-Z]*   # anything that's left
    \b          # end at word 
/x

If you want it without the fancy formatting, just remove the whitespace and comments:

/\b[A-Z][a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/

I explain all of these features in Learning Perl.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • 4
    Isn't a single capitalized word (such as Perl or Boing) also a valid CamelCase word? In that case, the quantifier should be {1,} or simply + – Barry Brown May 02 '09 at 23:16
  • @Barry: In many case, it would cause more problems than solve them. I like Brians' versions. @Brian: What does the flag /x mean which you do not use in your last command? – Léo Léopold Hertz 준영 May 03 '09 at 00:08
  • Perl or Boing are not camel-cased because they are not compound words. – brian d foy May 03 '09 at 00:27
  • 3
    You guys need to be more careful when you talk about something being camel case: do you mean ArabianCamelCase (also known as DromedaryCase, one word okay) or BactrianCamelCase (multiple words)? – Anon Gordon May 03 '09 at 00:42
  • 1
    Not to mention AliceTheCamelCase (also known as lowercase). – Anon Gordon May 03 '09 at 00:42
  • 2
    What about the third form, smallFirstLetter case? Isn't that also camel case? After all, no matter what kind of camel, the hump(s) are always in the middle, not at the ends. – AmbroseChapel May 03 '09 at 01:16
  • @Ambrose: That's what I know camel case as. – Bill Lynch May 03 '09 at 01:22
  • 2
    Note that this regex will also pick up words that consist of all capitals (depending on your precise definition of camel case, these words may or may not be considered camel cased). If you want to restrict to just camel cased words containing at least one lowercase letter, use: /\b([A-Z][a-z]*)+[A-Z][a-z]+([A-Z][a-z]*)*\b/ – j_random_hacker May 03 '09 at 08:23
  • Yeah, consecutive capital letters is a definition problem. If I were going over source code, I'd pick up those XXX I litter everywhere. – brian d foy May 03 '09 at 11:58
  • I think somebody needs to make a Regexp::Common module to handle these cases. – Kent Fredric May 03 '09 at 17:28
  • @briandfoy While first solution is understandable, the second is unfortunately indigestible especially for the novice Perl programmers. There must be something more elegant. – mabalenk Jun 16 '21 at 08:26
  • It's a messy problem. If you think of something that's better, though, post it! – brian d foy Jun 16 '21 at 15:17
8

Assuming you aren't using the regex to do extraction, and just matching...

[A-Z][a-zA-Z]*

Isn't the only real requirement that it's all letters and starts with a capital letter?

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
Bill Lynch
  • 80,138
  • 16
  • 128
  • 173
  • 2
    This is pretty much equivalent to Brian's regex except less complicated. You could detect words like HellotheRe, which obviously isn't correct CamelCase, but no regex can tell what is a word in there. Just put in the boundary marks and this should be good enough. – Unknown May 03 '09 at 01:54
  • @BillLynch What do I need to do if extraction is required? I came up with the following: `perl -pi -e 's/mx(?=[A-Z]*[a-zA-Z]*)/mtrx/g'`. My camel case words start with `mx`, that I would like to rename into `mtrx`. I noticed that word boundary command `\b` doesn't work here. – mabalenk Jun 16 '21 at 08:50
5

brian's and sharth's answers will also report words that consist entirely of uppercase letters (e.g. FOO). This may or may not be what you want. If you want to restrict to just camel-cased words that contain at least one lowercase letter, use:

/\b[A-Z][a-zA-Z]*[a-z][a-zA-Z]*\b/

If in addition you wish to exclude words that consist of a single uppercase letter followed by any number of lowercase letters (e.g. Perl), use:

/\b[A-Z][a-zA-Z]*(?:[a-z][a-zA-Z]*[A-Z]|[A-Z][a-zA-Z]*[a-z])[a-zA-Z]*\b/

(Basically, we require the string to start with a capital letter and to contain at least one additional capital letter and one lowercase letter; these latter two can appear in either order.)

Community
  • 1
  • 1
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • 1
    Your first example matches things that aren't compound words, like "Foo". The second one is a bit hairy for early morning golfing. :) – brian d foy May 03 '09 at 12:07
  • @brian: As you know, with regexes it's often a case of "some hair required." :) I hope it's clear from the 2nd body text paragraph that the 1st regex will match "Foo" et al. (since the purpose of the 2nd regex is specifically to exclude those matches). – j_random_hacker May 03 '09 at 14:28
0

Use this one:

/\b[A-Z]([a-z]+[A-Z]?)*\b/
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jagmal
  • 5,726
  • 9
  • 35
  • 35