1

I am using DOM to parse some website. I am parsing this:

<option value="A26JUYT14N57PY">Aleksander&#39;s Kindle Cloud Reader</option>
<option value="A13400OMTGFDRH">Aleksander&#39;s Kindle for PC</option>
<optgroup label="----OR----" style="color:#999;font-style:normal;font-weight:normal"> </optgroup>
<option value="add-new">Register a new Kindle</option>

My script is:

$dom->getElementsByTagName('option');
foreach($options as $option)
{
    $attr = $option->getAttribute('value');
    $value = $option->nodeValue;
}

On my pc with PHP 5.3.9 it works normally:

$attr1 = "A26JUYT14N57PY";
$value1 = "Aleksander&#39;s Kindle Cloud Reader";

$attr2 = "A13400OMTGFDRH";
$value2 = "Aleksander&#39;s Kindle for PC";

$attr3 = "add-new";
$value3 = "Register a new Kindle";

But when I upload script on server this doesn't work any more (I am not sure what PHP version it is but it's < 5.3.0). The results are:

$attr1 = "A26JUYT14N57PY";
$value1 = "'";

$attr2 = "A13400OMTGFDRH";
$value2 = "'";

$attr3 = "add-new";
$value3 = "";

So only apostrophes left from the strings in nodeValues - I think it's something with encoding but I am not sure... Strange thing is that only nodeValues are wrong and value attributes are OK...

-------------- edit

Here is code parsing webpage (source of classes it uses are above). $page is a html source code of webpage returned by CURL - I can't give you direct url because it's after login on amazon.

$dom = HtmlDomParser::getDomFromHtml($page);    
            $form = FormDomParser::getFormByName($dom,$this->amazon_config->buy_form_name);

            if($form===false)
            {
                throw new AmazonParseException("Couldn't parse buy form");
            }

            $select = FormDomParser::getSelectByName($dom,$this->amazon_config->buy_deliveryoptions_name);
            if($select === false)
            {

                    throw new AmazonParseException("Couldn't parse options select");

            }

            $options = FormDomParser::getOptions($select);

            $result = array();
            foreach($options as $option)
            {
                //$value = $option->childNodes->item(0)->nodeValue;
                //print_r($value);

                $device_id = $option->getAttribute('value');
                $device_name = $option->nodeValue;

                echo $device_id.' = '.$device_name.'</br>';


            }

HtmlDomParser

// simples class for parsing html files with DOM
    class HtmlDomParser
    {
        // converts html (as string) to DOM object
        public static function getDomFromHtml($html)
        {
            $dom = new DOMDocument;
            $dom->loadHTML($html);
            return $dom;
        }

        // gets all occurances of specified tag from dom object
        // these tags must contain specified (in attributes array) attributes
        public static function getTagsByAttributes($dom,$tag,$attributes = array())
        {
            $result = array();
            $elements = $dom->getElementsByTagName($tag);

            foreach($elements as $element)
            {
                $attributes_ok = true;
                foreach($attributes as $key => $value)
                {
                    if($element->getAttribute($key)!=$value)
                    {
                        $attributes_ok = false;
                        break;
                    }
                }

                if($attributes_ok)
                {
                    $result[] = $element;
                }
            }
            return $result;
        }
    }

FormDomParser

class FormDomParser
    {
        // gets form (as dom object) with specified name
        public static function getFormByName($dom,$form_name)
        {
            $attributes['name'] = $form_name;
            $forms = HtmlDomParser::getTagsByAttributes($dom,'form',$attributes);
            if(count($forms)<1)
            {
                return false;
            }
            else
            {
                return $forms[0];
            }
        }

        // gets all <input ...> tags from specified DOM object
        public static function getInputs($dom)
        {
            $inputs = HtmlDomParser::getTagsByAttributes($dom,'input');
            return $inputs;
        }

        // internal / converts array of Dom objects into assiosiative array
        public static function convertInputsToArray($inputs)
        {
            $inputs_array = array();
            foreach($inputs as $input)
            {
                $name = $input->getAttribute('name');
                $value = $input->getAttribute('value');

                if($name!='')
                {
                    $inputs_array[$name] = $value;
                }
            }   
            return $inputs_array;
        }


        // gets all <select ...> tags from DOM object
        public static function getSelects($dom)
        {
            $selects = HtmlDomParser::getTagsByAttributes($dom,'select');
            return $selects;
        }

        // gets <select ...> tag with specified name from DOM object
        public static function getSelectByName($dom,$name)
        {
            $attributes['name'] = $name;
            $selects = HtmlDomParser::getTagsByAttributes($dom,'select',$attributes);
            if(count($selects)<1)
            {
                return false;
            }
            else
            {
                return $selects[0];
            }
        }

        // gets <option ...> tags from DOM object
        public static function getOptions($dom)
        {
            $options = HtmlDomParser::getTagsByAttributes($dom,'option');
            return $options;
        }

        // gets action value from form (as DOM object)
        public static function getAction($dom)
        {
            $action =  $dom->getAttribute('action');
            if($action == "")
            {
                return false;
            }
            else
            {
                return $action;
            }
        }
    }

--------- edit

Here is the http header od site I am trying to parse (returned by curl):

HTTP/1.1 200 OK Date: Fri, 11 May 2012 08:54:23 GMT Server: Server x-amz-id-1: 
0CHN2KA4VD4FTXF7K62J p3p: policyref=&quot;http://www.amazon.com/w3c/p3p.xml&quot;,CP=&quot;CAO 
DSP LAW CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN 
COM NAV INT DEM CNT STA HEA PRE LOC GOV OTC &quot; x-frame-options: SAMEORIGIN 
x-amz-id-2: fFWynUQG0oqudmoDO+2FEraC2H+wWl0p9RpOyGxwyXKOc9u/6f2v8ffWUFkaUKU6 
Vary: Accept-Encoding,User-Agent Content-Type: text/html; charset=ISO-8859-1 
Set-cookie: ubid-main=190-8691333-9825146; path=/; domain=.amazon.com; 
expires=Tue, 01-Jan-2036 08:00:01 GMT Set-cookie: session-id-time=2082787201l; 
path=/; domain=.amazon.com; expires=Tue, 01-Jan-2036 08:00:01 GMT Set-cookie: 
session-id=187-8097468-1751521; path=/; domain=.amazon.com; expires=Tue, 
01-Jan-2036 08:00:01 GMT Transfer-Encoding: chunked

----------------------- edit

I just used http://simplehtmldom.sourceforge.net and it works great.

user606521
  • 14,486
  • 30
  • 113
  • 204

4 Answers4

0

The problem must be '. DOM operates on XML documents and you would need a CDATA section in order to have & characters in values.

Remove ' and check if it works. If it does, then you need CDATA

Alexios Tsiaparas
  • 880
  • 10
  • 17
0

Try getting the nodeValue of the text node itself:

$value = $option->firstChild->nodeValue;
Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592
0

Few things you can try. And try them all after uploading to your server. Not on your own machine.

  1. Test with $dom->loadXML() and $dom->loadHTML() methods.
  2. Check Kolink's answer or this on $value = $option->childNodes->item(0)->nodeValue;
  3. $array = simplexml_load_string($dom->saveXML($option)) and see if array has what you need.
  4. do a $dom->loadHTML(html_entity_decode($html));
thevikas
  • 1,618
  • 1
  • 14
  • 31
  • 1. Doesnt work - dom crashes when I try to load it as XML. 2. Same error - Trying to get property of non-object. 3. saveXML(..) returns this: `` – user606521 May 11 '12 at 08:18
0

I'd say (guess) it's more the difference in configuration than the PHP version. That's because sometimes DOMDocument substitutes entities, and sometimes not (this has been discussed as well in attacking the DOMDocument component in How can I use PHP's various XML libraries to get DOM-like functionality and avoid DoS vulnerabilities, like Billion Laughs or Quadratic Blowup?).

An interesting configuration setting is LIBXML_NOENT:

$doc->loadXML($src, LIBXML_NOENT);

You have not shared any code, so I don't know if this applies to yours.

Another thing you should take a look at (as I've experienced that), is the encoding of the document. It might help to re-save the document / convert it to UTF-8 properly. If entities can be substituted when saving the HTML, they normally are.

The third option is that you write yourself some code that substitutes the entitiy elements with textnodes and you then normalize the document again to combine textnodes that can be combined.

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
  • I tried `$page = utf8_encode($page); $dom = HtmlDomParser::getDomFromHtml($page); ` but the result is the same (HtmlDomParser just uses `->loadHtml(..)` – user606521 May 11 '12 at 08:32
  • Okay, you are not using PHP's `DOMDocument` here but some other library. Which one? Also, please provide the URL of the page, otherwise this is really akward to reproduce. – hakre May 11 '12 at 08:33
  • I edited comment - its DOM, just wraped in some class, I will post all my code soon. I cant give you url because it's after login procedure on my account :(. – user606521 May 11 '12 at 08:35
  • Is there some similar page w/o login or just some page on the same server w/o login? That would allow to check encodings, which looks like the important part here. – hakre May 11 '12 at 08:36
  • Might help, too. Have you seen the comment about the libxml version? And can you specify the PHP versions as well? – hakre May 11 '12 at 08:42
  • Also, please try if `$option->textContent;` instead of `nodeValue` does the job for you. Might be a workaround. – hakre May 11 '12 at 09:07
  • textContent returns the same - 2x apostrophes, and for the third option nothing. I uploaded source along with http header returned by curl - http://depositfiles.com/files/k2a6qmbai - sorry I couldnt find better upload site - you have to wait 60 secs... its one file source.html.txt – user606521 May 11 '12 at 09:18
  • PHP version on server: 5.2.17 – user606521 May 11 '12 at 09:20