4

Using XPath to webscrape.

The structure is:

<table>
  <tbody>
     <tr>
        <th>
        <td>

but one of those tr has contains just one th or one td.

<table>
      <tbody>
         <tr>
            <th>

So I just want to scrape if TR contains two tags inside it. I am giving the path

 $route = $path->query("//table[count(tr) > 1]//tr/th");

or

 $route = $path->query("//table[count(tr) > 1]//tr/td");

But it's not working.

I am giving the orjinal table's links here. First table's last two TR is has just one TD. That is causing the problem. And 2nd or 3rd table has same issue as well.

https://www.daiwahouse.co.jp/mansion/kanto/tokyo/y35/gaiyo.html

      $route = $path->query("//tr[count(*) >= 2]/th");
      foreach ($route as $th){
          $property[] = trim($th->nodeValue);
      }

      $route = $path->query("//tr[count(*) >= 2]/td");
      foreach ($route as $td){
          $value[] = trim($td->nodeValue);
      }

I am trying to select TH and TD at the same time. BUT if TR has contains one TD then it caunsing the problem. Because in the and TD count and TH count not same I am scraping more TD then the TH

3 Answers3

2

This XPath,

//table[count(.//tr) > 1]/th

will select all th elements within all table elements that have more than one tr descendent (regardless of whether tbody is present).


This XPath,

//tr[count(*) > 1]/*

will select all children of tr elements with more than one child.


This XPath,

//tr[count(th) = count(td)]/*

will select all children of tr elements where the number of th children equals the number of td children.


OP posted a link to the site. The root element is in the xmlns="http://www.w3.org/1999/xhtml" namespace.

See How does XPath deal with XML namespaces?

kjhughes
  • 106,133
  • 27
  • 181
  • 240
  • close enough but still taking extras... in the end my property(th) counts and my value(th) counts are not the same... –  Jan 25 '19 at 03:46
  • If you could precisely state what you're trying to select, it'd be simple to write the XPath. For example, "I am trying to select ____ elements where the following conditions are met: ______. Note that I do not want those elements when ____ condition is met." I've thrown out a second guess of what you might want in the meantime. – kjhughes Jan 25 '19 at 03:51
  • I updated the question and gave the orginal table's link. could you please check it out. –  Jan 25 '19 at 03:55
  • Your update still has not said what elements you wish to select (`td`, `th`, either, `tr`, `table`, etc), and you've not clearly specified the conditions distinguishing which such element you want from which you do not want. – kjhughes Jan 25 '19 at 03:56
  • I am trying to select TH and TD at the same time. BUT if TR has contains one TD then it caunsing the problem. Because in the and TD count and TH count not same I am scraping more TD then the TH –  Jan 25 '19 at 03:58
  • Updated with XPath that'll only select `td` and `th` elements when there's an equal number of them in a `tr`. Is that what you want? – kjhughes Jan 25 '19 at 04:04
  • thank you so much this nearly solved the problem. the opposite of this //tr[count(th) = count(td)]/* is this; //tr[count(td) = count(th)]/* because I have two query one of is TH's value and the other one is TD like in the question update... –  Jan 25 '19 at 04:12
  • What? `count(th) = count(td)` is equivalent to, not the "opposite of" `count(td) = count(th)`. Sorry, but this is an excessive amount of back-and-forth. I'm moving on. Good luck. – kjhughes Jan 25 '19 at 04:16
  • Alright, I'm glad you solved it. I'm not sure how I helped, but hopefully I did somehow. :-) – kjhughes Jan 25 '19 at 04:18
  • 1
    Haha, you really did! :) –  Jan 25 '19 at 04:19
0

If I understand correctly, you want th elements in trs that contain two elements? I think that this is what you need:

//th[count(../*) = 2]
zneak
  • 134,922
  • 42
  • 253
  • 328
  • okay, I tried it like this `//th[count(../*) == 2]` but, this error popping up. "Invalid argument supplied for foreach()" what what "*" for? –  Jan 25 '19 at 03:45
  • To explain by example, `count(tr)` counts the number of `tr` elements (not the number of elements under a `tr`). `count(*)` counts every node in the current path. `count(../*)` counts the number of siblings of the current node. – zneak Jan 25 '19 at 03:47
  • I updated the question and gave the orginal table's link. could you please check it out. –  Jan 25 '19 at 03:55
0

I've included a more explicit path in my answer with a or statement to count TH and TD elements

$html = '
  <html>
    <body>
      <table>
        <tbody>
          <tr>
            <th>I am Included</th>
            <td>I am a column</td>
          </tr>
        </tbody>
      </table>
      <table>
        <tbody>
          <tr>
            <th>I am ignored</th>
          </tr>
        </tbody>
      </table>
      <table>
        <tbody>
          <tr>
            <th>I am also Included</th>
            <td>I am a column</td>
          </tr>
        </tbody>
      </table>
    </body>
  </html>
';

$doc = new DOMDocument();
$doc->loadHTML( $html );

$xpath = new DOMXPath( $doc );
$result = $xpath->query("//table[ count( tbody/tr/td | tbody/tr/th ) > 1 ]/tbody/tr");

foreach( $result as $node )
{
  var_dump( $doc->saveHTML( $node ) );
}

// string(88) "<tr><th>I am Included</th><td>I am a column</td></tr>"
// string(93) "<tr><th>I am also Included</th><td>I am a column</td></tr>"

You can also use this for any depth descendants

//table[ count( descendant::td | descendant::th ) > 1]//tr

Change the xpath after the condition (square bracketed part) to change what you return.

Scuzzy
  • 12,186
  • 1
  • 46
  • 46
  • I updated the question and gave the orginal table's link. could you please check it out. –  Jan 25 '19 at 03:54