0

The html snippet in a url (www.foo.com/index.html):

...
<th class="name" align="left" scope="col">
<a class="foo" href="foo.html">foo</a>
</th>
...
<th class="name" align="left" scope="col">
<a class="bar" href="bar.html">bar</a>
</th>
...
<th class="name" align="left" scope="col">
<a class="ba" href="baz.html">baz</a>
</th>
......

I would like to get, through php all the text inside the class .name and convert it to JSON

So that it ends up like:

{"names":["foo","bar","baz"]}

This is what I have tried:

function linkExtractor($html){
    $nameArr = array();
    $doc = new DOMDocument();
    $doc->loadHTML($html);
    $names = //how do i get the elements?
    foreach($names as $name) {
        array_push($nameArr, $name);
    }
    return $imageArr;
}

echo json_encode(array("names" => linkExtractor($html)));
maxisme
  • 3,974
  • 9
  • 47
  • 97

2 Answers2

2

try this ...

$html = "http://www.foo.com/index.html"; //is this right?
function linkExtractor($html, $classname){
    $nameArr = array();
    $doc = new DOMDocument();
    $doc->loadHTML($html);

    $names = $doc->xpath("//*[@class='" . $classname . "']");

    foreach($names as $name) {
        array_push($nameArr, $name);
    }
    return $imageArr;
}

echo json_encode(array("names" => linkExtractor($html,".name")));
Anri
  • 1,706
  • 2
  • 17
  • 36
  • and before you try this, rest assured that it won't work. – hakre May 08 '14 at 12:49
  • I am getting the error `Missing argument 2 for linkExtractor(),` – maxisme May 08 '14 at 12:49
  • use edited version of answer ... – Anri May 08 '14 at 12:50
  • @Maximilian: That error only prevented you from getting the next fatal error. See the linked duplicate on how to actually run that xpath query. – hakre May 08 '14 at 12:53
  • Why does this not work? it seems like it should? – maxisme May 08 '14 at 12:53
  • @Anri still no luck with the update? – maxisme May 08 '14 at 12:58
  • Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 16 in /home/content/57/9770557/html/untitled folder/jsonnames.php on line 8 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity, line: 18 in /home/content/57/9770557/html/untitled folder/jsonnames.php on line 8 – maxisme May 08 '14 at 13:01
  • can you try again to put ; end of this **$html = "http://www.foo.com/index.html"** – Anri May 08 '14 at 13:03
  • Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: expecting ';' in Entity, line: 16 in /home/content/57/9770557/html/untitled folder/jsonnames.php on line 8 Warning: DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity, line: 18 in /home/content/57/9770557/html/untitled folder/jsonnames.php on line 8 – maxisme May 08 '14 at 13:07
  • @Anri this is my new error above! – maxisme May 08 '14 at 13:08
  • take a look at **http://stackoverflow.com/questions/12328322/php-domdocumentloadhtml-domdocument-loadhtml-htmlparseentityref-no-name** and **http://stackoverflow.com/questions/1685277/warning-domdocumentloadhtml-htmlparseentityref-expecting-in-entity** I think you have some other errors as well – Anri May 08 '14 at 13:14
  • Okay now i am just getting this error – maxisme May 08 '14 at 13:17
  • @Anri Fatal error: Call to undefined method DOMDocument::query() in /home/content/57/9770557/html/untitled folder/jsonnames.php on line 8 – maxisme May 08 '14 at 13:18
  • instead of query try xpath see updated version ... – Anri May 08 '14 at 13:21
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/52308/discussion-between-maximilian-and-anri) – maxisme May 08 '14 at 13:24
0

So just this has an end:

$names = function($html) {
    $doc  = new DOMDocument();
    $last = libxml_use_internal_errors(TRUE);
    $doc->loadHTML($html);
    libxml_use_internal_errors($last);
    $xp     = new DOMXPath($doc);
    $result = array();
    foreach ($xp->query("//*[contains(concat(' ', normalize-space(@class), ' '), ' name ')]") as $node)
        $result[trim($node->textContent)] = 1;
    return array_keys($result);
};

echo json_encode(array("names" => $names($html)));

Output:

{"names":["foo","bar","baz"]}

Required PHP version: 5.3+

hakre
  • 193,403
  • 52
  • 435
  • 836
  • 1
    this returns nothing. – maxisme May 08 '14 at 16:29
  • like this `{"names":[]}` – maxisme May 08 '14 at 16:30
  • If you see that output, it means that it generally works, however the HTML is not as you wrote in your question. As you can see it perfectly works: http://3v4l.org/3TUPb - So if you provide HTML that does not contain such (e.g. by beign plain invalid so DOM refuses to load), fix the HTML first. You're probably just having some HTML problem, totally unrelated to traverse the nodes. – hakre May 08 '14 at 19:50