Need regex to take only the first two sentences even if other instances occur

Question

I need help with a regex that finds the first two words at the start then takes only the first two sentences after, despite how many instances occur in the text.

text = "The Smithsonian museum is home to a variety of different art displays.  According various reports art appreciation is on the rise.  Blah blah blah blah.  The Smithsonian museum blah blah blah.  Blah blah blah blah."

My code looks something like this:

(re.findall(r"""((The Smithsonian|The Metropolitan)[^\.]*\.[^\.]*\.)""", text))

However, this is returning multiple instances instead of just the first two sentences, and oftentimes it returns junk like "The Smithsonian, " at the end. Can you please help? Thanks!

Perhaps use an anchor? `^(The Smithsonian|The Metropolitan)[^.]*\.[^.]*\.` — 4castle, Jul 14 '16 at 20:04
Do you need to take into account words like "Mr." or "Mrs."? — Erutan409, Jul 14 '16 at 20:29

score 0 · Answer 1 · answered Jul 14 '16 at 20:24

0

Try this:

^(The Smithsonian|The Metropolitan).+?(?>\.).+?(?>\.)

answered Jul 14 '16 at 20:24

user3597719

567
1
5
11

Federico Piazza · Answer 2 · 2016-07-14T20:36:52.023

0

I'm not python dev, but the problem seems that you are using findall, so as far as I know you can use finditer (and search the first iteration) or search to find just once match object.

However, if you want to use findall, then you can add the ^ anchor to your regex:

^((The Smithsonian|The Metropolitan)[^\.]*\.[^\.]*\.)

regex demo

edited Jul 14 '16 at 20:36

answered Jul 14 '16 at 20:31

Federico Piazza

30,085
15
87
123

score 0 · Answer 3 · edited Jul 14 '16 at 21:33

0

With this regex, you don't have to hard code any beginning phrase for the sentences. It will match exactly 2 occurrences of a sentence followed by the spaces before the next sentence.

^((?:\w+(?:\s|\.))+\s+){2}

Here is the testing link for it: https://regex101.com/r/mJ4oR7/2

This is assuming there are no special characters within the string.

edited Jul 14 '16 at 21:33

Thomas

130
4

answered Jul 14 '16 at 20:39

m_callens

6,100
8
32
54

This looks like something that would be useful however, I need to only take the sentences with those qualifiers in it. How would I implement this same code with the beginning two qualifiers? I tried doing it myself but having a bit of trouble thanks! – staten12 Jul 15 '16 at 14:54

score 0 · Answer 4 · edited May 23 '17 at 12:14

If you want to exclude "The Smithsonian" etc. from the result, use (?:) in the second group:

((?:The Smithsonian|The Metropolitan)[^\.]*\.[^\.]*\.)

Now your group 0 should only return the sentences.

>>> x = "The Smithsonian museum is home to a variety of different art displays.  According various reports art appreciation is on the rise.  Blah blah blah blah.  The Smithsonian museum blah blah blah.  Blah blah blah blah."
>>> y = re.findall(r"""((?:The Smithsonian|The Metropolitan)[^\.]*\.[^\.]*\.)""", text)
>>> y[0]
'The Smithsonian museum is home to a variety of different art displays.  According various reports art appreciation is on the rise.'

See also What is a non-capturing group? What does a question mark followed by a colon (?:) mean?.

Need regex to take only the first two sentences even if other instances occur

4 Answers4