Writing multiple regex pattern to parse HTML

Question

I'm fetching an HTML webpage with file_get_contents(), I get a table like below, there are more than 150 rows:

<tr class="tabrow ">
    <td class="tabcol  tdmin_2l">FIRST+DATA</td>
    <td class="tabcol">
        <a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">
            asdxxx
        </a>
    </td>
    <td class="tabcol"></td>
    <td class="tabcol">FOURTH+DATA</td>
</tr>

I want to get the FIRST DATA, SECOND DATA, THIRD DATA and FOURTH DATA with a preg_match_all() call. I tried to write multiple patterns, but I couldn't succeed. Here's what I tried:

preg_match_all('/(<td class="tabcol  tdmin_2l">|title=")(.*?)(<\/td>|")/s', $raw, $matches, PREG_SET_ORDER);

What's the true patterns?

Use a DOM parser instead. Parsing HTML markup with a regular expression is very unreliable. It will break the moment some minor change is done to the markup. — arkascha, Nov 26 '16 at 10:28

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

It does not answer your question directly, but it's the correct way to go.

You should avoid parsing HTML/XML content with regular expressions. Wonder why?

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

— https://stackoverflow.com/a/590789/65732

Use a DOM parser instead. Here's a glimpse of what it's like:

composer require symfony/dom-crawler symfony/css-selector

<?php

require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = <<<HTML
<tr class="tabrow ">
<td class="tabcol  tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;

$crawler = new Crawler($html);

$first  = $crawler->filter('.tabcol.tdmin_2l')->text();
$second = $crawler->filter('.tabcol:nth-child(2) a')->attr('title');
$third  = $crawler->filter('.tabcol:nth-child(2) a')->attr('href');
$fourth = $crawler->filter('.tabcol:nth-child(4)')->text();

var_dump($first, $second, $third, $fourth);
// Outputs:
// string(10) "FIRST+DATA"
// string(11) "SECOND+DATA"
// string(10) "THIRD+DATA"
// string(11) "FOURTH+DATA"

Easier and cleaner, right?

Using such parsers, you have the ability to extract elements using XPath as well.

I like this solution, but are you sure it finds 2nd and 3rd in the "title" and "href" attributes? — fafl, Nov 26 '16 at 17:34

fafl · Accepted Answer · 2016-11-26T10:44:31.990

2

Try this:

$str = <<<HTML
<tr class="tabrow ">
<td class="tabcol  tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;

preg_match_all('/<td[^>]*>(.*?)<\/td>/im', $str, $td_matches);
preg_match('/ title="([^"]*)"/i', $td_matches[1][1], $title);
preg_match('/ href="([^"]*)"/i', $td_matches[1][1], $href);

echo $td_matches[1][0] . "\n";
echo $title[1] . "\n";
echo $href[1] . "\n";
echo $td_matches[1][3];

edited Nov 26 '16 at 10:44

answered Nov 26 '16 at 10:28

fafl

7,222
3
27
50

thanks, this is not bad but looks like some modifications in regex pattern, because second and third datas are combined in this pattern. – yucel Nov 26 '16 at 10:38
I didn't understand at first, is it better like this? – fafl Nov 26 '16 at 10:45
thanks, i made some modifications and now it works good! – yucel Nov 26 '16 at 11:07

Writing multiple regex pattern to parse HTML

2 Answers2