-1

I've this html page. I'm trying to extract the following information of this div:

<div class="clearfix">
<div class="container left">    
    <div class="logo">
      <a href="/teams/belarus/fc-bate-borisov/200/">
        <img src="http://cache.images.core.optasports.com/soccer/teams/150x150/200.png" alt="FC BATE Borisov" />
      </a>
    </div>
  </div>

  <div class="container middle">
    <div class="details clearfix">
      <dl>
        <dt>Gara</dt>
        <dd><a href="/national/belarus/premier-league/2016/regular-season/r34862/">Premier League</a></dd>

        <dt>Data</dt>
        <dd><a href="/matches/2016/06/25/"><span class='timestamp' data-value='1466877600' data-format='d mmmm yyyy'>25 giugno 2016</span></a></dd>

        <dt>Game week</dt>
        <dd>14</dd>

        <dt>calcio di inizio</dt>
        <dd>
          <span class='timestamp' data-value='1466877600' data-format='HH:MM'>20:00</span>
          (<span class="game-minute">FP'</span>)
        </dd>
      </dl>
    </div>

    <div class="details clearfix">
      <dl>
        <dt>Stadio</dt>
        <dd><a href="venue/">Borisov Arena (Barysaw (Borisov))</a></dd>

      </dl>
    </div>

  </div>

  <div class="container right">
    <div class="logo">
      <a href="/teams/belarus/fc-vitebsk/204/">
        <img src="http://cache.images.core.optasports.com/soccer/teams/150x150/204.png" alt="FC Vitebsk" />
      </a>
    </div>
  </div>
</div>
    </div>
  </div>
</div>

in particular the tab calcio di inizio - game week - stadio

Actually I've tried this regex: <div[^<>]*class="clearfix"[^<>]*>(?<content>.*?)

but when I test it on https://regex101.com/ I can't run the regex. I think that the class of the div is associated on multiple divs, so this could be the problem.

And also the doesn't have any class for take it, any idea?

  • 2
    Have you considered using a proper HTML parser instead? – Pekka Jun 25 '16 at 19:37
  • Please see the [standard answer](http://stackoverflow.com/a/1732454) for why not to do it with regexes. Now, to answer your question, you might use something like [Xidel](http://www.videlibri.de/xidel.html). Perhaps like this: `xidel -e '//div[@class="clearfix"]' file.html`. – Sato Katsura Jun 25 '16 at 19:47
  • which do you suggest? I'm on .net – John Carter Jun 25 '16 at 19:52

1 Answers1

0

If you add an id to the div you want to get the contents of (for example "myDiv"), you could run the following javascript function to return it's HTML contents:

document.getElementById("myDiv").innerHTML

I am not exactly sure if this is what you want, since its not regex, but if so, I hope this helps!

user31415
  • 446
  • 7
  • 16