0

I am aiming to replace the anchor text in anchor tags of given text block with the title of the page the href attribute points to.

That is my html contains -

"This is <a href="https://www.example.com">www.example.com</a>"

I'd like to replace it with -

"This is <a href="https://www.example.com">Example Domain</a>"

Here's my PHP Code -

$domDocument = new \DOMDocument();
$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR);
$domDocument->formatOutput = true;
$links = $domDocument->getElementsByTagName('a');

// Step 3: Iterate on Each Link
foreach($links as $link)
{
// Step 4: Extract the href attribute from the link
$href = $link->getAttribute('href');

// Step 5: Using the extracted href, fetch the page title
$title = $this->fetchPageTitle($href);

// Step 8: Replace the existing anchor text with page title
$link->nodeValue = $title;

return $domDocument->saveHTML();

}

private function fetchPageTitle($url) :string
{
// Step 6: Fetch the contents of the page
$page_html = Http::get($url)->body();

// Step 7: Initiate a new DomDocument Object and exract page title
$pageDocument = new \DOMDocument();
$pageDocument->loadHTML($page_html, LIBXML_NOERROR);
$title = $pageDocument->getElementsByTagName('title')->item(0)->nodeValue;
return $title;
}

While this code works, it produces garbled text for some of the text.

That is the text: Settings → Apple → Tag

Is reformatted to Settings → Apple → Tag.

and We'll gets reformatted to We’ll

How do I make fix this issue?

TheBigK
  • 451
  • 5
  • 17
  • It looks like variables that you use for replacements (e.g. `$title`) are in a different enconding than your HTML. Try using [mb_detect_encoding](https://www.php.net/manual/en/function.mb-detect-encoding.php) on your HTML, replacement values and end result to check whether they actually match – apokryfos Mar 22 '23 at 11:28
  • Why is there a `return` in your `foreach` loop? Why is this code not indented? – miken32 Mar 22 '23 at 14:46
  • And see here: https://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly – miken32 Mar 22 '23 at 14:48
  • Sorry about the indentation. I realised the error with return statement and fixed it. It only prevented from processing the second link. Looks like this does the job: `$domDocument->loadHTML(mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8'), LIBXML_HTML_NOIMPLIED | LIBXML_NOERROR);` – TheBigK Mar 23 '23 at 12:14

0 Answers0