Why is only the last occurrence being matched?

Asked Feb 04 '18 at 20:22

Active Feb 04 '18 at 20:39

Viewed 96 times

0

I have the following string:

s = '''
    <a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
    <a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''

I am trying to match both groups between the <span>s

using the following

regex = re.compile('<a class="biz-name[\w\W]*<span>(.*)</span>')
regex.findall(s)

expected:

['Gus’s World Famous Fried Chicken', 'South City Kitchen - Midtown']

actual

['South City Kitchen - Midtown']

Why is only the last occurrence being matched?

asked Feb 04 '18 at 20:22

cosmosa

743
1
10
23

2 Answers2

1

You shouldn't parse xml with regex. That said, the greedyness of the regex got you, [\w\W]* pretty much matches anything, so it eats up the first expressions.

Adding a non-greedy ? token ([\w\W]*?) fixes that. And doesn't hurt to add one in the group as well. I have replaced [\w\W]*? by .*? as it's simpler and equivalent.

regex = re.compile('<a class="biz-name.*?<span>(.*?)</span>')

See this on regex101.

edited Feb 04 '18 at 20:35

Joe Iddon

20,101
7
33
54

answered Feb 04 '18 at 20:30

Jean-François Fabre

137,073
23
153
219

the code works. What's wrong with the comment? – Jean-François Fabre Feb 04 '18 at 20:32
@RomanPerekhrest first sentence of my answer: you shouldn't parse xml with regex. – Jean-François Fabre Feb 04 '18 at 20:34

1

Regex is usually never the best way to scrape HTML. For instance, an alternative would be to use BeautifulSoup:

from bs4 import BeautifulSoup
s = '''
<a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
<a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''
s = BeautifulSoup(s, 'lxml')
results = [i.text for i in s.find_all('span')]

Output:

[u'Gus’s World Famous Fried Chicken', u'South City Kitchen - Midtown']

However, a simple regex solution:

import re
s = '''
 <a class="biz-name"><span>Gus’s World Famous Fried Chicken</span></a>
 <a class="biz-name"><span>South City Kitchen - Midtown</span></a>
'''
final_results = re.findall('<span>(.*?)</span>', s)

Output:

['Gus’s World Famous Fried Chicken', 'South City Kitchen - Midtown']

edited Feb 04 '18 at 20:39

answered Feb 04 '18 at 20:37

Ajax1234

69,937
8
61
102

why the downvote? – Ajax1234 Feb 04 '18 at 20:37
Thanks for the input, and I agree it might be a better approach but it's irrelevant to the question what the approach is. The goal is to use regex. – cosmosa Feb 04 '18 at 20:37
3

@cosmosa, using regex on HTML data is wrong goal – RomanPerekhrest Feb 04 '18 at 20:38
1

While suggesting a better way to extract that text is perfectly fine, this actually does not answer the question: *“Why is only the last occurrence being matched?”* – poke Feb 04 '18 at 20:39
@cosmosa Please see my recent edit. I added a regex solution. – Ajax1234 Feb 04 '18 at 20:40
I like the beautifulsoup approach regardless of the fact that it's a regex question, to solve an html problem, so it's basically not ideal. – Jean-François Fabre Feb 04 '18 at 20:43
1

@poke Answers are not required to adhere to bad design restrictions posed by questions. Stating that they're bad and offering the better alternative is the *preferred* way to answer. As a 173k user, you should know better than to promote bad code. – jpmc26 Feb 04 '18 at 20:51
@jpmc26 Yes, we should show the OP the best way to perform their task. However, when the OP has asked why their current code doesn't behave as expected we _also_ need to answer that question! IMHO, it's important to correct misconceptions about language features, especially when that is the main thrust of the OP's question. – PM 2Ring Feb 08 '18 at 14:25
@PM2Ring And that question is clearly a duplicate which needs to not be answered at all. – jpmc26 Feb 08 '18 at 18:41