0

Im building a script that give me an product array by parsing html from a list of websites.

I believe that Im doing everything right.. But for some reason i have alots of difficulty with only one website Makita.ca

So.. Im using DOMXPath for retrieving element. i am providing the RAW html that im getting from makita.ca

What picture i want to get is those on the pictures that are on the left

please also note that the only thing i need is the link of the image and not the actual image.

the folowing image page is at http://www.makita.ca/index2.php?event=tool&id=100 enter image description here

    $productArray = array();
    $Dom = new DOMDocument();
    @$Dom -> loadHTML($this->html);
    $xpath = new DOMXPath($Dom);
    echo $xpath -> query('//*[@id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table/tbody/tr[1]/td/div/a/img')->length;
        if($xpath -> query('//*[@id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table')->length > 0)
        {
            for($i=0;$i<$xpath->query('//*[@id="content_other"]/table[2]/tbody/tr/td[1]/table/tbody/tr[4]/td/table/tbody/tr')->length;$i++)
            {
                if($xpath->query('//*[@id="content_other"]/table[2]/tr/td[1]/table/tr[4]/td/table/tr['.$i.']/td/div/a/img') > 0)
                    $productArray['picture'][] = $xpath -> query('//*[@id="content_other"]/table[2]/tr/td[1]/table/tr[4]/td/table/tr['.$i.']/td/div/a/img')->item(0)->nodeValue;
            }
        }

Do you see what is my mistake ? since now im really lost.

Edit:

ok for test purposes i am echoing the length of the query() method witch should give me how much element match the query

So I retyped to hole query down so they can't have any non asci character So i retyped the hole query '//*[@id="content_other"]/table[2]//tr/td1/table//tr[4]/td/table//tr1/td/div‌​/a/img' then the result is 0

So i removed the end of the query part by part..

//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td/div‌​/a = 0
//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td/div‌​ = 0
//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td = 0
//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1] = 0
//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table = 0
//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td = 0
//*[@id="content_other"]/table[2]//tr/td[1]/table//tr = 5

Wooo i got some element matching here ! ok let try the last element witch is the one i need so since it is zero based then to get the tr number 5 i need to enter as a path this

//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]

But I still get 0.... So i dont know what to do any more..

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
Nicolas Racine
  • 1,031
  • 3
  • 13
  • 35
  • 2
    This is an exact duplicate, removing all the `/tbody` steps is all you need to do. Refer to given reference for details. – Jens Erat Jan 15 '14 at 20:36
  • @JensErat Hey thanks guys. but i removed the tbody and still i can't get it working. for starting i just edited echo $xpath -> query('//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td/div/a/img')->length; and it echoing 0 – Nicolas Racine Jan 15 '14 at 21:10
  • 1
    I'm not sure what's messed up, but one of the characters of the last four axis steps is non-ascii and breaks the query. Try `//*[@id="content_other"]/table[2]//tr/td[1]/table//tr[4]/td/table//tr[1]/td/div/a/img` (I retyped those four steps). – Jens Erat Jan 15 '14 at 21:27
  • @JensErat I got something very wierd now.. lol very wierd.. let me edit my original post. Take a look – Nicolas Racine Jan 15 '14 at 21:48
  • try with `//a[@rel="thumbnail"]/@href` instead of the direct path and/or check the markup returned in DOM instead of in a browser. The markup in the browser likely contains implied markup as explained in http://stackoverflow.com/questions/5689011/dom-and-xpath-scraping-what-wrong-here/5689495#5689495 – Gordon Jan 15 '14 at 21:56
  • XPath is 1-based. The XPath expression I posted works for me. Did you try copy&past'ing it? The expressions you posted are broken for me again. @Gordon: I tried the expression with BaseX (not evaluating any JS, ...) and it works fine (after `/tbody` axis steps are removed). – Jens Erat Jan 15 '14 at 21:57

1 Answers1

0

//div[@class='product_heading']/ancestor-or-self::table[1]//a/img selects firstly the "Action Shots", then all the images found under this bloc.

This XPath expression will be more reliable than yours, because of the low number of positional expressions which tends to break easily as the markup changes.

//div[@class='product_heading']/ancestor-or-self::table[1]//a[@rel='thumbnail']/img would be a stronger security

Grooveek
  • 10,046
  • 1
  • 27
  • 37