I am aware that using Regex to parse html code is technically incorrect but found this out too far into starting this project (it's for some coursework that I have already stated that I am going to use Regex for so too late to go back on that now)
Im trying to make a python program that takes a html document, strips out the numbers contained after the card-count class and then append them to a list, the problem is that rather than finding the first match when it runs it seems to find the first one and all the others that are identical to the first one and so on, here is some example html and my regex:
<span class="card-count">1</span>
<span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BGarruk%5D+%5BRelentless%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&name=Garruk+Relentless" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Garruk Relentless</a></span>
</span>
<span class="row">
<span class="card-count">2</span>
<span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BJace,%5D+%5Bthe%5D+%5BMind%5D+%5BSculptor%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&name=Jace%2C+the+Mind+Sculptor" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Jace, the Mind Sculptor</a></span>
</span>
</div>
<div class="sorted-by-creature clearfix element">
<h5>Creature (16)</h5>
<span class="row">
<span class="card-count">4</span>
<span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BDeathrite%5D+%5BShaman%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&name=Deathrite+Shaman" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Deathrite Shaman</a></span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BNoble%5D+%5BHierarch%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&name=Noble+Hierarch" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Noble Hierarch</a></span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BStoneforge%5D+%5BMystic%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&name=Stoneforge+Mystic" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Stoneforge Mystic</a></span>
</span>
<span class="row">
<span class="card-count">4</span>
<span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BTrue-Name%5D+%5BNemesis%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&name=True-Name+Nemesis" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">True-Name Nemesis</a></span>
</span>
</div>
<div class="sorted-by-sorcery clearfix element">
<h5>Sorcery (3)</h5>
<span class="row">
<span class="card-count">3</span>
<span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BPonder%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&name=Ponder" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Ponder</a></span>
</span>
And the python code is:
card_number_list=[]
number_of_cards=int(0)
#find out how many of x cards there are in the deck
def card_number_regex(card_number_list):
global number_of_cards
global html
number_in_set= re.search("card-count.*",html)
get_rid= re.search("card-count.*",html).group(0)
html=html.replace(get_rid,"")
number_in_set=number_in_set.group(0)
html=html.replace(number_in_set, "")
number_in_set=number_in_set.replace('card-count">',"")
number_in_set=number_in_set.replace('</span>', "")
card_number_list.append(number_in_set)
number_in_set_int=int(number_in_set)
print(number_in_set_int)
number_of_cards=(number_of_cards+number_in_set_int)
return number_of_cards
while number_of_cards<75:
card_number_regex(card_number_list)
The output I get when I run this is 1 2 4 3