0

I have the two following snippets in Python (short_sentence is part of long_sentence here)

short_sentence = '<p data-reactid="389">THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.</p>'

long_sentence = '<description>&lt;img src=&quot;http://cdn.static-economist.com/sites/default/files/images/print-edition/20170211_LDC811.png&quot; alt=&quot;&quot; title=&quot;&quot; height=&quot;376&quot; width=&quot;458&quot; class=&quot; blog-post-article-image blog-post-article-image__slim&quot; data-reactid=&quot;388&quot;/&gt;&lt;p data-reactid=&quot;389&quot;&gt;THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.&lt;/p&gt;&lt;p data-reactid=&quot;390&quot;&gt;To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.&lt;/p&gt;'

I'd like to parse each of the (the shortest possible) substrings here between &lt; + anything + *&gt; and &lt;/p&gt; strings. I know that in short_sentence there is one such occurence:

THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.

In long_sentence, there is the one above and another one:

To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.

I understand that Python's re.findall() gives back all occurences of matching subtexts of a text. When I try to execute the following:

re.findall("&lt;p.*&gt;(.*?)&lt;/p&gt;", short_sentence)

I get the correct supposed result:

['THE prospect of deregulation helps explain why, since Donald Trump\xe2\x80\x99s election, no bit of the American stockmarket has done better than financial firms. On February 3rd their shares climbed again as Mr Trump signed an executive order asking the Treasury to conduct a 120-day review of America\xe2\x80\x99s financial regulations, including the Dodd-Frank act put in place after the financial crisis of 2007-08, to assess whether these rules meet a set of \xe2\x80\x9ccore principles\xe2\x80\x9d.']

At the same time, when I try to parse the two substrings from long_sentence with the following:

re.findall("&lt;p.*&gt;(.*?)&lt;/p&gt;", long_sentence)

I still get only one occurence (the second one):

['To critics of Dodd-Frank, this is thrilling stuff. They see the law as a piece of statist overreach that throttles the American economy. Plenty in the Trump administration would love to gut it. The president himself has called it a \xe2\x80\x9cdisaster\xe2\x80\x9d. Gary Cohn, until recently one of the leaders of Goldman Sachs, a big bank, and now Mr Trump\xe2\x80\x99s chief economic adviser, promises to \xe2\x80\x9cattack all aspects of Dodd-Frank\xe2\x80\x9d.']

My question is: what goes wrong here in the second instance? Why doesn't return it both the occurences?

Hendrik
  • 1,158
  • 4
  • 15
  • 30

1 Answers1

0

p.* is greedy, so it takes everything it can. If you use instead p.*? you will get the expected result.

A bit more of info on that topic here, if you need it: http://www.regular-expressions.info/repeat.html

An extract:

Suppose you want to use a regex to match an HTML tag. You know that the input will be a valid HTML file, so the regular expression does not need to exclude any invalid use of sharp brackets. If it sits between sharp brackets, it is an HTML tag.

Most people new to regular expressions will attempt to use <.+>. They will be surprised when they test it on a string like This is a first test. You might expect the regex to match and when continuing after that match, .

Community
  • 1
  • 1
dquijada
  • 1,697
  • 3
  • 14
  • 19