0

I'm fetching an HTML webpage with file_get_contents(), I get a table like below, there are more than 150 rows:

<tr class="tabrow ">
    <td class="tabcol  tdmin_2l">FIRST+DATA</td>
    <td class="tabcol">
        <a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">
            asdxxx
        </a>
    </td>
    <td class="tabcol"></td>
    <td class="tabcol">FOURTH+DATA</td>
</tr>

I want to get the FIRST DATA, SECOND DATA, THIRD DATA and FOURTH DATA with a preg_match_all() call. I tried to write multiple patterns, but I couldn't succeed. Here's what I tried:

preg_match_all('/(<td class="tabcol  tdmin_2l">|title=")(.*?)(<\/td>|")/s', $raw, $matches, PREG_SET_ORDER);

What's the true patterns?

sepehr
  • 17,110
  • 7
  • 81
  • 119
yucel
  • 365
  • 4
  • 11

2 Answers2

3

It does not answer your question directly, but it's the correct way to go.

You should avoid parsing HTML/XML content with regular expressions. Wonder why?

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language. The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

https://stackoverflow.com/a/590789/65732

Use a DOM parser instead. Here's a glimpse of what it's like:

composer require symfony/dom-crawler symfony/css-selector
<?php

require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = <<<HTML
<tr class="tabrow ">
<td class="tabcol  tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;

$crawler = new Crawler($html);

$first  = $crawler->filter('.tabcol.tdmin_2l')->text();
$second = $crawler->filter('.tabcol:nth-child(2) a')->attr('title');
$third  = $crawler->filter('.tabcol:nth-child(2) a')->attr('href');
$fourth = $crawler->filter('.tabcol:nth-child(4)')->text();

var_dump($first, $second, $third, $fourth);
// Outputs:
// string(10) "FIRST+DATA"
// string(11) "SECOND+DATA"
// string(10) "THIRD+DATA"
// string(11) "FOURTH+DATA"

Easier and cleaner, right?

Using such parsers, you have the ability to extract elements using XPath as well.

Community
  • 1
  • 1
sepehr
  • 17,110
  • 7
  • 81
  • 119
2

Try this:

$str = <<<HTML
<tr class="tabrow ">
<td class="tabcol  tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;

preg_match_all('/<td[^>]*>(.*?)<\/td>/im', $str, $td_matches);
preg_match('/ title="([^"]*)"/i', $td_matches[1][1], $title);
preg_match('/ href="([^"]*)"/i', $td_matches[1][1], $href);

echo $td_matches[1][0] . "\n";
echo $title[1] . "\n";
echo $href[1] . "\n";
echo $td_matches[1][3];
fafl
  • 7,222
  • 3
  • 27
  • 50