1

I have a return value from a search I'm doing which returns alot of HTML.

for i in deal_list:
        regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
        pattern2 = re.compile(regex2)
        info2 = re.search(pattern2,htmltext)
        html_captured = info2.group(0).split('</figure>')
        print html_captured

Here is an example what is being returned:

<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{&quot;accessType&quot;:&quot;extended&quot;}" data-bh-viewport="respect">
      <a href="//www" class="deal-tile-inner">
        <img>
      <figcaption>
                  <div class="deal-tile-content">
          <p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
          <p class="merchant-name truncation ">1742 Wine Bar</p>
            <p class="deal-location truncate-others ">
              <span class="deal-location-name">Upper East Side</span> 
            </p>
  <div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
        </div>
        <div class="purchase-info clearfix ">
          <p class="deal-price">
              <s class="original-price">$178.90</s>
              <s class="discount-price">$49</s>

  </p>
          <div class="hide show-in-list-view">
            <p class="deal-tile-actions">
          <button class="btn-small btn-buy" data-bhw="ViewDealButton">
            View Deal
          </button>
</p>
  </div>
        </div>
      </figcaption>
      </a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{&quot;accessType&quot;:&quot;extended&quot;}" data-bh-viewport="respect">
            <a href="//www" class="deal-tile-inner">
              <img>
      <figcaption>
                        <div class="deal-tile-content">
          <p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
          <p class="merchant-name truncation ">Statler Grill</p>
            <p class="deal-location truncate-others ">
              <span class="deal-location-name">Midtown</span> 
            </p>
  <div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
        </div>
        <div class="purchase-info clearfix ">
          <p class="deal-price">
              <s class="original-price">$213</s>
              <s class="discount-price">$89</s>

  </p>
          <div class="hide show-in-list-view">
            <p class="deal-tile-actions">
          <button class="btn-small btn-buy" data-bhw="ViewDealButton">
            View Deal
          </button>
</p>
  </div>
        </div>
      </figcaption>
            </a>
</figure>

I want to use html_captured = info2.group(0).split('</figure> so that all HTML between each new set of tags become an element of a list, in this case HTML_CAPTURED.

It kind of works except that each becomes its own list with a '' at the end. For example: ['<figure .... </figure>','']['<figure .... </figure>','']

But what I want is ['<figure .... </figure>','<figure .... </figure>','<figure .... </figure>'...etc]

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Jonathan Scialpi
  • 771
  • 2
  • 11
  • 32

1 Answers1

1

There are special tools for parsing HTML - HTML parsers.

Example using BeautifulSoup:

from bs4 import BeautifulSoup

data = """
your html here
"""

soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]

Also see why you should not use regex for parsing HTML:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195