0

I need help with a regex that finds the first two words at the start then takes only the first two sentences after, despite how many instances occur in the text.

text = "The Smithsonian museum is home to a variety of different art displays.  According various reports art appreciation is on the rise.  Blah blah blah blah.  The Smithsonian museum blah blah blah.  Blah blah blah blah."

My code looks something like this:

(re.findall(r"""((The Smithsonian|The Metropolitan)[^\.]*\.[^\.]*\.)""", text))

However, this is returning multiple instances instead of just the first two sentences, and oftentimes it returns junk like "The Smithsonian, " at the end. Can you please help? Thanks!

Thomas
  • 130
  • 4
staten12
  • 735
  • 3
  • 9
  • 20

4 Answers4

0

Try this:

^(The Smithsonian|The Metropolitan).+?(?>\.).+?(?>\.)

user3597719
  • 567
  • 1
  • 5
  • 11
0

I'm not python dev, but the problem seems that you are using findall, so as far as I know you can use finditer (and search the first iteration) or search to find just once match object.

However, if you want to use findall, then you can add the ^ anchor to your regex:

^((The Smithsonian|The Metropolitan)[^\.]*\.[^\.]*\.)

regex demo

Federico Piazza
  • 30,085
  • 15
  • 87
  • 123
0

With this regex, you don't have to hard code any beginning phrase for the sentences. It will match exactly 2 occurrences of a sentence followed by the spaces before the next sentence.

^((?:\w+(?:\s|\.))+\s+){2}

Here is the testing link for it: https://regex101.com/r/mJ4oR7/2

This is assuming there are no special characters within the string.

Thomas
  • 130
  • 4
m_callens
  • 6,100
  • 8
  • 32
  • 54
  • This looks like something that would be useful however, I need to only take the sentences with those qualifiers in it. How would I implement this same code with the beginning two qualifiers? I tried doing it myself but having a bit of trouble thanks! – staten12 Jul 15 '16 at 14:54
0

If you want to exclude "The Smithsonian" etc. from the result, use (?:) in the second group:

((?:The Smithsonian|The Metropolitan)[^\.]*\.[^\.]*\.)

Now your group 0 should only return the sentences.

>>> x = "The Smithsonian museum is home to a variety of different art displays.  According various reports art appreciation is on the rise.  Blah blah blah blah.  The Smithsonian museum blah blah blah.  Blah blah blah blah."
>>> y = re.findall(r"""((?:The Smithsonian|The Metropolitan)[^\.]*\.[^\.]*\.)""", text)
>>> y[0]
'The Smithsonian museum is home to a variety of different art displays.  According various reports art appreciation is on the rise.'

See also What is a non-capturing group? What does a question mark followed by a colon (?:) mean?.

Community
  • 1
  • 1
xystum
  • 939
  • 6
  • 8