Get consecutive capitalized words using regex

Question

I am having trouble with my regex for capturing consecutive capitalized words. Here is what I want the regex to capture:

"said Polly Pocket and the toys" -> Polly Pocket

Here is the regex I am using:

re.findall('said ([A-Z][\w-]*(\s+[A-Z][\w-]*)+)', article)

It returns the following:

[('Polly Pocket', ' Pocket')]

I want it to return:

['Polly Pocket']

So what if the input was `i Have A String and It Is Long`? Should it give `['Have A String', 'It Is Long']` or `['Have A String and It Is Long']` — Adam Parkin, Mar 01 '12 at 23:49
Why do you have the word "said" in your findall? Do you actually intend to only find consecutive capital words following "said "? — jgritty, Mar 01 '12 at 23:53

score 31 · Accepted Answer · answered Mar 01 '12 at 23:49

Use a positive look-ahead:

([A-Z][a-z]+(?=\s[A-Z])(?:\s[A-Z][a-z]+)+)

Assert that the current word, to be accepted, needs to be followed by another word with a capital letter in it. Broken down:

(                # begin capture
  [A-Z]            # one uppercase letter  \ First Word
  [a-z]+           # 1+ lowercase letters  /
  (?=\s[A-Z])      # must have a space and uppercase letter following it
  (?:                # non-capturing group
    \s               # space
    [A-Z]            # uppercase letter   \ Additional Word(s)
    [a-z]+           # lowercase letter   /
  )+              # group can be repeated (more words)
)               #end capture

This still gives `['Polly Pocket', ' Pocket']` when I run it. — Adam Parkin, Mar 01 '12 at 23:51
@Adam: Had to do with the internal group also capturing. Run what I have now, post the breakdown addition. — Brad Christie, Mar 01 '12 at 23:53

mathematical.coffee · Answer 2 · 2012-03-02T00:02:20.977

7

It's because findall returns all the capturing groups in your regex, and you have two capturing groups (one that gets all the matching text, and the inner one for subsequent words).

You can just make your second capturing group into a non-capturing one by using (?:regex) instead of (regex):

re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', article)

edited Mar 02 '12 at 00:02

answered Mar 01 '12 at 23:49

mathematical.coffee

55,977
11
154
194

I don't think 'said' was intended as being part of the regex. Ie: `he likes Polly Pocket' should return the same matches. – Adam Parkin Mar 01 '12 at 23:53
oh apologies, I blindly copied from OP. – mathematical.coffee Mar 02 '12 at 00:02

score 5 · Answer 3 · edited Sep 19 '13 at 00:28

5

$mystring = "the United States of America has many big cities like New York and Los Angeles, and others like Atlanta";

@phrases = $mystring =~ /[A-Z][\w'-]\*(?:\s+[A-Z][\w'-]\*)\*/g;

print "\n" . join(", ", @phrases) . "\n\n# phrases = " . scalar(@phrases) . "\n\n";

OUTPUT:

$ ./try_me.pl

United States, America, New York, Los Angeles, Atlanta

\# phrases = 5

edited Sep 19 '13 at 00:28

Josh Crozier

233,099
56
391
304

answered Sep 19 '13 at 00:10

Shibamouli Lahiri

61
1
2

Get consecutive capitalized words using regex

3 Answers3

Linked