12

I am having trouble with my regex for capturing consecutive capitalized words. Here is what I want the regex to capture:

"said Polly Pocket and the toys" -> Polly Pocket

Here is the regex I am using:

re.findall('said ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)', article)

It returns the following:

[('Polly Pocket', ' Pocket')]

I want it to return:

['Polly Pocket']
egidra
  • 8,537
  • 19
  • 62
  • 89
  • So what if the input was `i Have A String and It Is Long`? Should it give `['Have A String', 'It Is Long']` or `['Have A String and It Is Long']` – Adam Parkin Mar 01 '12 at 23:49
  • Why do you have the word "said" in your findall? Do you actually intend to only find consecutive capital words following "said "? – jgritty Mar 01 '12 at 23:53

3 Answers3

31

Use a positive look-ahead:

([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)

Assert that the current word, to be accepted, needs to be followed by another word with a capital letter in it. Broken down:

(                # begin capture
  [A-Z]            # one uppercase letter  \ First Word
  [a-z]+           # 1+ lowercase letters  /
  (?=\s[A-Z])      # must have a space and uppercase letter following it
  (?:                # non-capturing group
    \s               # space
    [A-Z]            # uppercase letter   \ Additional Word(s)
    [a-z]+           # lowercase letter   /
  )+              # group can be repeated (more words)
)               #end capture
Brad Christie
  • 100,477
  • 16
  • 156
  • 200
7

It's because findall returns all the capturing groups in your regex, and you have two capturing groups (one that gets all the matching text, and the inner one for subsequent words).

You can just make your second capturing group into a non-capturing one by using (?:regex) instead of (regex):

re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', article)
mathematical.coffee
  • 55,977
  • 11
  • 154
  • 194
5
$mystring = "the United States of America has many big cities like New York and Los Angeles, and others like Atlanta";

@phrases = $mystring =~ /[A-Z][\w'-]\*(?:\s+[A-Z][\w'-]\*)\*/g;

print "\n" . join(", ", @phrases) . "\n\n# phrases = " . scalar(@phrases) . "\n\n";

OUTPUT:

$ ./try_me.pl

United States, America, New York, Los Angeles, Atlanta

\# phrases = 5
Josh Crozier
  • 233,099
  • 56
  • 391
  • 304