-2

I am trying to capture the HTML between a set of <li></li> and put the captured text in an array.

The way I am trying to parse it is with this Ruby expression:

page.scan(/<li><div class="info">(.*)<\/li>/)

However, for some reason, it returns no matches. What am I doing wrong?

Here is what the HTML looks like:

   <ul class="local">

        <li><div class="info">

    <span class="num">1</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/105111879-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Boston City Of Boston Housing Authority Main O</a></h2>

      <p><b>Address:</b>
        52 Chauncy St, Boston, MA 02111
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe0" class="rateMe" title="Rate this company">

      <a id="0_1" title="1" ></a>

      <a id="0_2" title="2" ></a>

      <a id="0_3" title="3" ></a>

      <a id="0_4" title="4" ></a>

      <a id="0_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/105111879-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">2</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/105109841-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Boston Checkcashers INC East Boston</a></h2>

      <p><b>Address:</b>
        19 Maverick Sq, Boston, MA 02128
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe1" class="rateMe" title="Rate this company">

      <a id="1_1" title="1" ></a>

      <a id="1_2" title="2" ></a>

      <a id="1_3" title="3" ></a>

      <a id="1_4" title="4" ></a>

      <a id="1_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/105109841-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">3</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/181884283-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Boston City Of Boston Housing Authority Develo</a></h2>

      <p><b>Address:</b>
        755 Tremont St, Boston, MA 02118
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe2" class="rateMe" title="Rate this company">

      <a id="2_1" title="1" ></a>

      <a id="2_2" title="2" ></a>

      <a id="2_3" title="3" ></a>

      <a id="2_4" title="4" ></a>

      <a id="2_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/181884283-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">4</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/142710920-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Citizens Bank Phonebank Boston Offices Boston</a></h2>

      <p><b>Address:</b>
        771 Commonwealth Ave, Boston, MA 02215
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe3" class="rateMe" title="Rate this company">

      <a id="3_1" title="1" ></a>

      <a id="3_2" title="2" ></a>

      <a id="3_3" title="3" ></a>

      <a id="3_4" title="4" ></a>

      <a id="3_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/142710920-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">5</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/199373037-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Citizens Bank Phonebank Boston Offices Boston</a></h2>

      <p><b>Address:</b>
        771 Commonwealth Ave, Boston, MA 02215
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe4" class="rateMe" title="Rate this company">

      <a id="4_1" title="1" ></a>

      <a id="4_2" title="2" ></a>

      <a id="4_3" title="3" ></a>

      <a id="4_4" title="4" ></a>

      <a id="4_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/199373037-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">6</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/181906441-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Boston City Of Boston Housing Authority Develo</a></h2>

      <p><b>Address:</b>
        266 N Beacon St, Brighton, MA 02135
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe5" class="rateMe" title="Rate this company">

      <a id="5_1" title="1" ></a>

      <a id="5_2" title="2" ></a>

      <a id="5_3" title="3" ></a>

      <a id="5_4" title="4" ></a>

      <a id="5_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/181906441-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">7</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/181906436-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Boston City Of Boston Housing Authority Develo</a></h2>

      <p><b>Address:</b>
        91 Ames St, Dorchester Center, MA 02124
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe6" class="rateMe" title="Rate this company">

      <a id="6_1" title="1" ></a>

      <a id="6_2" title="2" ></a>

      <a id="6_3" title="3" ></a>

      <a id="6_4" title="4" ></a>

      <a id="6_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/181906436-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">8</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/142706974-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Boston City Of Boston Housing Authority Develo</a></h2>

      <p><b>Address:</b>
        15 Mary Moore Beatty Cir, Mattapan, MA 02126
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe7" class="rateMe" title="Rate this company">

      <a id="7_1" title="1" ></a>

      <a id="7_2" title="2" ></a>

      <a id="7_3" title="3" ></a>

      <a id="7_4" title="4" ></a>

      <a id="7_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/142706974-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">9</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/105111596-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Boston Handyman For Boston Eastern Massachusetts</a></h2>

      <p><b>Address:</b>
        12 Muldoons Ct, Waltham, MA 02453
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe8" class="rateMe" title="Rate this company">

      <a id="8_1" title="1" ></a>

      <a id="8_2" title="2" ></a>

      <a id="8_3" title="3" ></a>

      <a id="8_4" title="4" ></a>

      <a id="8_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/105111596-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

        <li><div class="info">

    <span class="num">10</span>

  <div style="margin:0 0 0 45px;">

    <h2><a href="/local_detail_l/199782811-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6">Boston Clothing Architect</a></h2>

      <p><b>Address:</b>
        10 Tremont St, Boston, MA 02108
      </p>


      </div>
</div>

<div class="ratingbox">
  <span id="rateMe9" class="rateMe" title="Rate this company">

      <a id="9_1" title="1" ></a>

      <a id="9_2" title="2" ></a>

      <a id="9_3" title="3" ></a>

      <a id="9_4" title="4" ></a>

      <a id="9_5" title="5" ></a>

  </span>

    <div class="not-rated"><a href="/local_detail_l/199782811-ST25/Boston,%20MA/Boston,%20MA?_session_id=73215ec8bd6d1cf4da158da341e450d6#new-review" class="not-rated">Be the first to review!</a></div>

</div>

</li>

    </ul>
Jørgen R
  • 10,568
  • 7
  • 42
  • 59
user858642
  • 187
  • 2
  • 2
  • 6
  • 1
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Uku Loskit Jul 23 '11 at 19:48

2 Answers2

4

You really can't parse html with regex. Try nokigiri.

Don Roby
  • 40,677
  • 6
  • 91
  • 113
  • 2
    "can't" might be [a little strong](http://stackoverflow.com/questions/4231382/regular-expression-pattern-not-matching-anywhere-in-string/4234491#4234491) but the usual naive approaches are full of holes. – mu is too short Jul 23 '11 at 22:01
1

Firstly read this: RegEx match open tags except XHTML self-contained tags

Regex: set single line option, use non greedy expression, i.e.:

(?s)<li><div class="info">(.*?)<\/li>

Community
  • 1
  • 1
Kirill Polishchuk
  • 54,804
  • 11
  • 122
  • 125