Extracting Data from html using preg_match_all

Question

I have a series of html pages from which I want to extract certain product information. The HTML is build up like this:

<h1 style="margin-top: 20px;">Productinformatie</h1>


<div class="group">
<div class="columns2">
            <table width="100%" cellpadding="4" cellspacing="0" border="0" class="product_info_table stripe">
    <tr style="background-color: #3c75a6; color: #fff; font-weight: bold;">
        <td colspan="2" style="background-color: #3c75a6; border-bottom: 2px solid #9dbeda;">Design</td>
    </tr>
                    <tr class="normal">
            <td width="250" valign="top"><b>Kleur van het product</b></td>
            <td><div style="max-height: 40px; overflow: hidden;">Zwart, Zilver</div></td>
        </tr>
.............
                    <tr class="normal">
            <td width="250" valign="top"><b>Hoogte (achterzijde)</b></td>
            <td><div style="max-height: 40px; overflow: hidden;">3 cm</div></td>
        </tr>
                </table>

</div>  
</div>

<div class="group" style="overflow-x: auto; overflow-y: hidden; height: 140px; white-space: nowrap;" id="image_scroll">

I Use this line but does not get results; I need to find out how Linebrakes (BR) can be formatted in preg_match_all

        //Omschrijving  <h1 style="margin-top: 20px;">Productinformatie</h1>    <div class="group"> <div class="columns2">  </table>    </div>      </div>
//  preg_match_all('/\<h1 style\=\"margin-top\: 20px\;\"\>Productinformatie\<\/h1\>(.*?)\<ul style\=\"list\-style\-type\: none\;\"\>/s', $html, $matchomschrijving);  
    preg_match_all('/\<h1 style\=\"margin-top\: 20px\;\"\>Productinformatie\<\/h1\>(.*)?\<\/table\>.*?\<\/div\>?\<\/div\>/s', $html, $matchomschrijving);  
//  $tempomschrijvinghtml = str_replace('"',"'",$matchomschrijving[1][0]); 
    $tempomschrijvinghtml = MinifyHTML($matchomschrijving[1][0]);
//  $tempomschrijving = '<table>';
    $tempomschrijving .= $tempomschrijvinghtml;
    $tempomschrijving .= '</table></div></div>';
    echo 'Omschrijving: ' . $tempomschrijving . '<br>';

Thanks.

Possible duplicate of [How do you parse and process HTML/XML in PHP?](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php) — Casimir et Hippolyte, Apr 28 '17 at 19:46
You can try to remove the new line from the $html variable. For example $html = str_replace("\n", "", $html); — Nadir Latif, Apr 30 '17 at 03:25

score 0 · Answer 1 · answered Apr 28 '17 at 20:28

0

To search, extract and edit html, take advantage of the build-in DOMxxx classes and of the html structure. With the XPath language you can efficiently target the part of the DOM tree you want. Example:

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHTML($html);
$xp = new DOMXPath($dom);

$nodeList = $xp->query('//h1[.="Productinformatie"]/following-sibling::div[@class="group"]/div[@class="columns2"]/table[1]');

echo $dom->saveHTML($nodeList->item(0));

answered Apr 28 '17 at 20:28

Casimir et Hippolyte

88,009
5
94
125

Thanx, I'll give it a try. I am not familiar with the XPath language. Could you help me get started if I would get the part between:
Productinformatie
and – Riekelt Keuter Apr 30 '17 at 19:02

Extracting Data from html using preg_match_all

1 Answers1

Productinformatie