0

I have the following string:

s = '''
    <a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
    <a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''

I am trying to match both groups between the <span>s

using the following

regex = re.compile('<a class="biz-name[\w\W]*<span>(.*)</span>')
regex.findall(s)

expected:

['Gus’s World Famous Fried Chicken', 'South City Kitchen - Midtown']

actual

['South City Kitchen - Midtown']

Why is only the last occurrence being matched?

cosmosa
  • 743
  • 1
  • 10
  • 23

2 Answers2

1

You shouldn't parse xml with regex. That said, the greedyness of the regex got you, [\w\W]* pretty much matches anything, so it eats up the first expressions.

Adding a non-greedy ? token ([\w\W]*?) fixes that. And doesn't hurt to add one in the group as well. I have replaced [\w\W]*? by .*? as it's simpler and equivalent.

regex = re.compile('<a class="biz-name.*?<span>(.*?)</span>')

See this on regex101.

Joe Iddon
  • 20,101
  • 7
  • 33
  • 54
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
1

Regex is usually never the best way to scrape HTML. For instance, an alternative would be to use BeautifulSoup:

from bs4 import BeautifulSoup
s = '''
<a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
<a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''
s = BeautifulSoup(s, 'lxml')
results = [i.text for i in s.find_all('span')]

Output:

[u'Gus’s World Famous Fried Chicken', u'South City Kitchen - Midtown']

However, a simple regex solution:

import re
s = '''
 <a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
 <a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''
final_results = re.findall('<span>(.*?)</span>', s)

Output:

['Gus’s World Famous Fried Chicken', 'South City Kitchen - Midtown']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • why the downvote? – Ajax1234 Feb 04 '18 at 20:37
  • Thanks for the input, and I agree it might be a better approach but it's irrelevant to the question what the approach is. The goal is to use regex. – cosmosa Feb 04 '18 at 20:37
  • 3
    @cosmosa, using regex on HTML data is wrong goal – RomanPerekhrest Feb 04 '18 at 20:38
  • 1
    While suggesting a better way to extract that text is perfectly fine, this actually does not answer the question: *“Why is only the last occurrence being matched?”* – poke Feb 04 '18 at 20:39
  • @cosmosa Please see my recent edit. I added a regex solution. – Ajax1234 Feb 04 '18 at 20:40
  • I like the beautifulsoup approach regardless of the fact that it's a regex question, to solve an html problem, so it's basically not ideal. – Jean-François Fabre Feb 04 '18 at 20:43
  • 1
    @poke Answers are not required to adhere to bad design restrictions posed by questions. Stating that they're bad and offering the better alternative is the *preferred* way to answer. As a 173k user, you should know better than to promote bad code. – jpmc26 Feb 04 '18 at 20:51
  • @jpmc26 Yes, we should show the OP the best way to perform their task. However, when the OP has asked why their current code doesn't behave as expected we _also_ need to answer that question! IMHO, it's important to correct misconceptions about language features, especially when that is the main thrust of the OP's question. – PM 2Ring Feb 08 '18 at 14:25
  • @PM2Ring And that question is clearly a duplicate which needs to not be answered at all. – jpmc26 Feb 08 '18 at 18:41