DOM parsing or regex to relevant info from html site

Question

Below is a small snippet of html code from a larger website. I'm trying parse some of the information from that site into a database. However Im unsure what best practice is. Should I use regex or can I use PHP DOM parser to get relevant data.

eg. I want to get info on "Prisantydning" => 2090000 and "Fellesformue" =>4483 and "verditakst" =>2300000

What do you suggest?

<div class="mod">
    <div class="inner">
        <div class="bd objectinfo" data-automation-id="information">
            <h2>Prisdetaljer</h2>
            <dl class="multicol colspan2 fleft mtn">

                    <dt>Prisantydning</dt>
                    <dd>2 090 000,-</dd>



            </dl>
            <dl class="multicol colspan2 fleft mlm mtn">

                    <dt>Fellesformue</dt>
                    <dd>4 483,-</dd>


                    <dt>Verditakst</dt>
                    <dd>
                            2 300 000,-
                        <button class="icon utility strong contrast helpButton"
                                data-helptext-id="Verditakst">?
                        </button>
                        <div id="Verditakst" class="helptext supportText">
                            Verditakst utføres av en autorisert takstmann, og er en teknisk vurdering av hva boligen er
                            verdt.
                            Dette samkjøres med meglers markedsvurdering.
                        </div>
                    </dd>


                    <dt>Låneverdi</dt>
                    <dd>
                            2 000 000,-
                        <button class="icon utility strong contrast helpButton"
                                data-helptext-id="Låneverdi">?
                        </button>
                        <div id="Låneverdi" class="helptext supportText">
                            Låneverdi er en vurdering av markedsverdi som skal gi banken den nødvendige sikkerhet for
                            pant i
                            eiendommen. Låneverdi ligger som oftest på 80 - 90% av verditakst.
                        </div>
                    </dd>

possible duplicate of [php regex or html dom parsing](http://stackoverflow.com/questions/9948459/php-regex-or-html-dom-parsing) — Barmar, Dec 23 '12 at 10:52
You're not new to SO. Surely you've noticed that every time someone tries to use regex to process HTML, everyone tells them to use a real parser instead. — Barmar, Dec 23 '12 at 10:53
html is not regular so using regex is not the right way, Barmar is right — artragis, Dec 23 '12 at 10:55

artragis · Answer 1 · 2012-12-25T09:15:23.280

0

html is not regular so using regex is not the right way, Barmar is right

You can use DOM like that :

$doc = new \DOMDocument();
$doc->loadHtmlFile($yourUrl);//needs ini option "allow_url_fopen" to be true
$datas = array();
foreach($doc->getElementsByTagName('dt') as $dt){
     //get the datas : $dt->textContent is the key
     //as a first approach we get the whole text value of the dd that is related
     $datas[$dt->textContent] = $dt->nextSibling->textContent ;
     //then we just get the figures
     $datas[$dt->textContent]=preg_replace('#[^0-9]+$#','',$datas[$dt->textContent]);
}

edited Dec 25 '12 at 09:15

answered Dec 23 '12 at 10:54

artragis

3,677
1
18
30

Hi, Thanks interesting example :) Looking around php.net and w3school.com I can't seem to find the reason why I can't load a site as DOMDocument. This is the page that I'm trying to open as a DOMDoc http://www.finn.no/finn/realestate/homes/object?finnkode=33862098 Is it something with the site that Im trying to load that causes the problem? – Chris_1983_Norway Dec 24 '12 at 18:33
perhaps you can use loadHtmlFile instead of load. I'll edit my answer. – artragis Dec 25 '12 at 09:15

score 0 · Answer 2 · answered Dec 23 '12 at 11:01

"Large website" usually means messy and unpredictable code. But even if that wasn't the case regular expressions are simply not meant to be used for evaluating HTML content (aside from few very simple cases where that kind of action might be justifiable). So yes you should use a DOM parser like DOMDocument here.

DOMXpath would also be a nice addition in this case because it would allow you to avoid the hassle of selecting DOMDocument nodes with DOM functions like getElementsByTagName etc.

It depends on the need. Here the selection is quite simple : the dt tags and its sibbling. Moreover, in memory XPath is more greedy : libXML uses buffer and pointer, XPath get the all NodeCollection in memory to traverse it. Not the best way I think. — artragis, Dec 25 '12 at 09:17

DOM parsing or regex to relevant info from html site

2 Answers2