-3

I am aware that using Regex to parse html code is technically incorrect but found this out too far into starting this project (it's for some coursework that I have already stated that I am going to use Regex for so too late to go back on that now)

Im trying to make a python program that takes a html document, strips out the numbers contained after the card-count class and then append them to a list, the problem is that rather than finding the first match when it runs it seems to find the first one and all the others that are identical to the first one and so on, here is some example html and my regex:

              <span class="card-count">1</span>
          <span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BGarruk%5D+%5BRelentless%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&amp;name=Garruk+Relentless" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Garruk Relentless</a></span>
        </span>

                                                <span class="row">
          <span class="card-count">2</span>
          <span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BJace,%5D+%5Bthe%5D+%5BMind%5D+%5BSculptor%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&amp;name=Jace%2C+the+Mind+Sculptor" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Jace, the Mind Sculptor</a></span>
        </span>


  </div>


  <div class="sorted-by-creature clearfix element">


    <h5>Creature (16)</h5>

                                      <span class="row">
          <span class="card-count">4</span>
          <span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BDeathrite%5D+%5BShaman%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&amp;name=Deathrite+Shaman" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Deathrite Shaman</a></span>
        </span>

                                                <span class="row">
          <span class="card-count">4</span>
          <span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BNoble%5D+%5BHierarch%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&amp;name=Noble+Hierarch" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Noble Hierarch</a></span>
        </span>

                                                <span class="row">
          <span class="card-count">4</span>
          <span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BStoneforge%5D+%5BMystic%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&amp;name=Stoneforge+Mystic" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Stoneforge Mystic</a></span>
        </span>

                                                <span class="row">
          <span class="card-count">4</span>
          <span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BTrue-Name%5D+%5BNemesis%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&amp;name=True-Name+Nemesis" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">True-Name Nemesis</a></span>
        </span>


  </div>


  <div class="sorted-by-sorcery clearfix element">


    <h5>Sorcery (3)</h5>

                                      <span class="row">
          <span class="card-count">3</span>
          <span class="card-name"><a href="http://gatherer.wizards.com/Pages/Search/Default.aspx?name=+%5BPonder%5D" data-src="http://gatherer.wizards.com/Handlers/Image.ashx?type=card&amp;name=Ponder" data-mp4="http://magic.wizards.com/" data-webm="http://magic.wizards.com/" data-gif="http://magic.wizards.com/" class="deck-list-link">Ponder</a></span>
        </span>

And the python code is:

card_number_list=[]
number_of_cards=int(0)
    #find out how many of x cards there are in the deck
def card_number_regex(card_number_list):
    global number_of_cards
    global html
    number_in_set= re.search("card-count.*",html)
    get_rid= re.search("card-count.*",html).group(0)
    html=html.replace(get_rid,"")
    number_in_set=number_in_set.group(0)
    html=html.replace(number_in_set, "")
    number_in_set=number_in_set.replace('card-count">',"")
    number_in_set=number_in_set.replace('</span>', "")
    card_number_list.append(number_in_set)
    number_in_set_int=int(number_in_set)
    print(number_in_set_int)
    number_of_cards=(number_of_cards+number_in_set_int)
    return number_of_cards

while number_of_cards<75:
    card_number_regex(card_number_list)

The output I get when I run this is 1 2 4 3

dovefromhell
  • 84
  • 1
  • 1
  • 12

1 Answers1

0

While many seem to rather bash on your choice to use regex for this task, I would argue that it does not seem too difficult for your specific goal and will provide an actual answer for what you asked for.

import re
a = html
b = re.findall('<span class="card-count">(.*?)</span>',a)
print(b[0])

That regex should give the contents of your card-count classes in a list, and using first index you retrieve only the match you want your regex to find.

Obviously this would work less well for other use-cases, but as you seem to know that you only ever want the first occurrence in the html-document it does not matter that list contains all of them, even when they are in another div tag etc.

And as others have said, I don't see why you wouldn't use a regular html parser for this.

felix
  • 111
  • 1
  • 8
  • Thank you very much, I only used regex because I've never come across html parsing before and a friend suggested I could use regex but i'll certainly be looking into them for further use – dovefromhell Jan 26 '17 at 10:34
  • Noticed I had done a typo in the code, sorry! Edited code-block now (somehow had managed to type "search" instead of "findall". re.search does not give a list, re.findall does). – felix Jan 26 '17 at 11:18