Regex PHP: Get specific content from a block of code from another website

Question

I have a site from which I want to get specific content from 7 posts. Those all 7 seven posts have same HTML layout (See Below)

<div class="eventInfo">
<h3>Z's（矢沢永吉）</h3>
  <h4>Z's TOUR 2015</h4>

<dl>
    <dt><img src="/event/img/btn_day.png" alt="公演日時" width="92" height="20"> </dt>
    <dd>
      <table width="99%" border="0" cellpadding="0" cellspacing="0">
        <tbody><tr>
      <td width="9%" nowrap="nowrap">2015年6月</td>
      <td width="74%">4日 (木) 19:00開演</td>
    </tr>

  </tbody></table>
</dd>
<dt><img src="/event/img/btn_price.png" alt="料金" width="92" height="20"> </dt>
<dd>S¥10,500　A¥7,500 (全席指定・消費税込）<br><span class="attention">※</span>注意事項の詳細を<a href="http://www.siteurl.com/info/live/guidelines.html" target="_blank">矢沢永吉公式サイト</a>より必ずご確認ください</dd>

<dt><img src="/event/img/btn_ticket.png" alt="一般発売" width="92" height="20"> </dt>
<dd>
 <table width="99%" border="0" cellpadding="0" cellspacing="0">
  <tbody><tr>
    <td width="9%" nowrap="nowrap">2015年5月</td>
    <td width="74%">16日(土)</td>
  </tr>
</tbody></table>
  </dd>

  <dt><img src="/event/img/btn_contact.png" alt="お問合わせ" width="92" height="20"> </dt>
  <dd><a href="http://www.siteurl.com/" target="_blank">ソーゴー大阪</a>　06-6344-3326</dd>

  <dt><img src="/event/img/btn_info.png" alt="公演詳細" width="92" height="20"> </dt>
  <dd><a href="http://www.siteurl.com/zs/index_pc.html" target="_blank">http://www.siteurl.com</a> </dd>
</dl>
</div>

I just want to fetch the H3 from this layout and the first table in the code. What regex method should I use to get the desired results?

Also these are 7 posts just like the code above and I have to get H3 and the first table from each of it.

I have tested it but not sure that is it a correct way or not: https://regex101.com/r/sO6tJ8/1

But as you can see that I have to add unwanted data too like H4 DT IMG :(

Did you read http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/ this? — Alessandro Da Rugna, Jun 26 '15 at 11:21
I know, it just points out that it is a **bad** idea to parse HTML with regexes. They are not the right tool for this job. If you are sure that the HTML will not change over time, and only need those strings, happy coding :-) — Alessandro Da Rugna, Jun 26 '15 at 11:30
Yes I'm sure that the HTML will not change. So what you is your suggestion now? :) — Omer, Jun 26 '15 at 11:32
Use an XML/HTML parser. http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php — Alessandro Da Rugna, Jun 26 '15 at 11:33
if it's actually always the same, you could split it on newlines and just fetch the [1] key from the split result... Parsers would get heavy. IF it's always the same of course. — Mathijs Segers, Jun 26 '15 at 11:52
I should be much more suitable in this case to use DomDocument to parse the html. (http://php.net/manual/en/class.domdocument.php) I use this in combination with curl in my project to fetch the title from linked urls. Take a look at line 630 here: https://github.com/mondjunge/Crush-Framework/blob/master/application/profilpage/Profilpage.php — mondjunge, Jun 26 '15 at 12:03

dbrumann · Answer 1 · 2015-06-28T11:24:41.110

I don't think regex is your best choice here. If you can get away without using regex I would do so. Have a look at Goutte a PHP web-scraper.

$crawler = $client->request('GET', 'http://www.example.com/some-page');
$heading = $crawler->filter('h3')->first();
$table = $crawler->filter('table')-> first();

This will not only be better readable, it will also make it easier to fix something when the html-structure changes.

If you must choose regex you could do something like the following for the h3 (haven't tested it, but something like that):

$html = preg_replace_callback(
    '/<h3>(.*?)<\/h3>/u',
    function ($match) {
        return $match[1];
    },
    $html
);

For the table it is similar only you have to use the multiline-modifier m (it wouldn't hurt to add it to the h3 as well, but from your example you don't need it).

score 0 · Answer 2 · answered Jun 30 '15 at 04:59

// I'm assuming you can get the HTML into a variable somehow
// I did my testing w/ a local file with your HTML content
$data = file_get_contents('foo.html');

$h3_content = array();
$table_content = array();

// h3 content is easy to grab, but it could be on multiple lines!
// I didn't account for multiline here:
preg_match('/<h3>([^<]+)<\/h3>/', $data, $h3_content);

// regex can't find the ending table tag easily, unless the 
// entire HTML on one line, so make everything one line
// you don't need a new variable here, I did it only to be explicit
// that we have munged the original HTML into something else
$data2 = str_replace("\n", '', $data);

// to separate tables, put new line after each one 
$data2 = str_replace('</table>', "</table>\n", $data2);
// now regex is easy
preg_match_all('/(<table.+<\/table>)/m', $data2, $table_content);

echo $h3_content[1], "\n";
echo $table_content[0][1], "\n";

Regex PHP: Get specific content from a block of code from another website

2 Answers2