2

I have a string of HTML and I need check whether the href attributes of any anchors contain a certain link pattern. If they match a certain pattern I need to modify them.

Here's a sample HTML string:

<p>Disculpa, pero esta entrada está disponible sólo en <a href="http://www.example.com/static/?json=get_page&amp;post_type=page&amp;slug=sample-page&amp;lang=ru">Pусский</a> y <a href="http://www.example.com/static/?json=get_page&amp;post_type=page&amp;sample-page&amp;lang=en">English</a>.</p>

So the URLs in question take the following pattern

http://www.example.com/static/?json=get_page&post_type=page&slug=sample-page&lang=ru

Where the lang query attribute is variable in its value.

If a href matching that pattern is found I need to change it to:

http://www.example.com/ru/sample-page

So I need to remove 'static' and replace it with the value of the lang attribute, and I need to append the value of the 'slug' attribute to the end of the URL.

Sadly I'm getting confounded at the first step so I haven't even been able to test out methods of parsing the URLs and replacing them with the new value.

    $html = '<p>Disculpa, pero esta entrada está disponible sólo en <a href="http://www.example.com/static/?json=get_page&amp;post_type=page&amp;slug=sample-page&amp;lang=ru">Pусский</a> y <a href="http://www.example.com/static/?json=get_page&amp;post_type=page&amp;sample-page&amp;lang=en">English</a>.</p>';
$dom = new DOMDocument;
    // The UTF-8 encoding is necessary
$dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
$anchors = $dom->getElementsByTagName('a');

In theory from this point on I'd loop through the anchors found and do stuff, but if I var_dump the $anchors variable I just get:

object(DOMNodeList)#66 (0) { }

So I can't even proceed further!

Any idea what's causing the DOM to fail to collect the anchors?

After that any suggestions on how to best identify if the anchor contains the URL pattern, change it and return the new modified HTML?

Update 1

So it turns out that there's a PHP bug pre 5.4.1 which prevents var_dump from displaying the contents of the DOMNodeList. I can find values with

foreach ($anchors as $anchors) {
    echo $anchors->nodeValue, PHP_EOL;
}

However I have no idea what the $anchors object really looks like so am running blind. If anyone has any suggestions on how to parse the $anchors and modify them as originally mentioned that would be hugely appreciated (whilst I try to sort out a PHP5.4.1 instance)

alexleonard
  • 1,314
  • 3
  • 21
  • 37
  • try doing `count($anchors)` see how much you get – DevZer0 Jul 16 '13 at 04:31
  • Oddly, count($anchors) returns 1, but the var_dump is: object(DOMNodeList)#66 (0) { } – alexleonard Jul 16 '13 at 04:38
  • I've updated the question to show the $html variable and var_dump($anchors) instead of print_r – alexleonard Jul 16 '13 at 04:40
  • works fine at my end, i get `DOMNodeList Object ( [length] => 2 )`. Did you check passing html without calling `mb_convert_encoding`. my guess is file encoding or data encoding is creating some problem for you. – user1402647 Jul 16 '13 at 04:42
  • That's super weird. I have absolutely no idea why it's not working. Hmm. I've dropped the encoding and still get var_dump($anchors) = object(DOMNodeList)#66 (0) { } – alexleonard Jul 16 '13 at 04:46
  • i get `DOMNodeList Object ( [length] => 2 ) ` – DevZer0 Jul 16 '13 at 04:54
  • whats the version of your PHP and DomDocument – DevZer0 Jul 16 '13 at 04:55
  • Right, I've discovered the issue thanks to a friend's explanation. I can't var_dump the DOM object. I have to do foreach ($anchors as $anchors) { echo $anchors->nodeValue, PHP_EOL; } and then it gives values. Which is weird. I've tested this on our server running php5.2 and locally on xampp on php5.3.8 – alexleonard Jul 16 '13 at 05:04
  • http://stackoverflow.com/questions/4776093/why-doesnt-var-dump-work-with-domdocument-objects-while-printdom-savehtml - that's what was throwing me on that one...! – alexleonard Jul 16 '13 at 05:08

5 Answers5

6

I have done a similar thing not long ago. You can iterate over a DOMNodeList and then get the href attribute of the anchor.

$dom = new DOMDocument;
$dom->loadHTML($content);
foreach ($dom->getElementsByTagName('a') as $node) {
    $original_url = $node->getAttribute('href');
    // Do something here
    $node->setAttribute('href', $var);
}
$html = $dom->saveHtml();
Hayden
  • 164
  • 1
  • 6
0

Maybe try to echo the html first? Maybe you're passing an empty html or something.

Wassim Gr
  • 568
  • 2
  • 8
  • 16
0

Try this you will get the href value

$anchors = $dom->getElementsByTagName('a');
echo $anchors->item(0)->attributes->getNamedItem('href');
DevZer0
  • 13,433
  • 7
  • 27
  • 51
0
function getLinks($link)
{
$ret=array();

$dom=new DOMDocument;

@$dom->loadHTML(file_get_contents($link));

$dom->preserveWhiteSpace=false;

$links=$dom->getElementsByTagName('a');
 $html=$dom->saveHTML();
foreach($links as $tag)
{
    @$ret[$tag->getAttribute('href')]=$tag->childNodes->item(0)->nodeValue;
}

return $ret;
}
$link="http://php.net";

 $url=getLinks($link);
Tobia Zambon
  • 7,479
  • 3
  • 37
  • 69
0

I do agree with Hyden's answers. But I want to make the solution more independent. Because sometimes while we manipulate with DOM document, we face with encoding issues. Here is the advanced solution given below...........

$dom = new DOMDocument;
$dom->loadHTML(mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8'));
foreach ($dom->getElementsByTagName('a') as $node) {
    $original_url = $node->getAttribute('href');
    // Do something here
    $node->setAttribute('href', $var);
}
$html = $dom->saveHtml();
Hasanuzzaman Sattar
  • 592
  • 1
  • 5
  • 20