PHP regex: How to convert HTML string with links into plain text that shows URL after text in brackets

Question

I'm looking for a way in PHP (with regex, maybe?) to convert a string of HTML that includes links into a string of plain text that adds the URL of the link after the text.

Here's an example of what I'm thinking:

$html = '<p><a href="http://www.example.com/maybe/something/here/">Link name</a> 
        for something or another. <a href="https://www.examplesecure.com/">Another link
        </a> to something else.</p>'

// Regex to find the URLs
????

// Add the found URLs as strings after the closing a tags
????

// Convert to plain text
$text = trim(strip_tags($html));

Ideally, I'd end up with this string:

Link name [http://www.example.com/maybe/something/here/] for something or another.
Another link [https://www.examplesecure.com/] to something else.

Do not use Regular Expressions to parse HTML. Never. Ever. Try [DomCrawler](http://symfony.com/doc/current/components/dom_crawler.html), or [DomDocument](http://php.net/manual/en/class.domdocument.php) for a native solution. — BugHunterUK, Feb 21 '16 at 19:32
I would use a DOM parser like Simple HTML DOM Parser to explode html. If you just are looking for the URL's, take a look at https://stackoverflow.com/questions/11588542/get-all-urls-in-a-string-with-php/11588614#11588614 — redelschaap, Feb 21 '16 at 19:35

Aleksey Ratnikov · Answer 1 · 2016-02-24T09:07:31.350

0

Use SimpleXMLelemnt for this:

    $htmlString = '<div><p><a href="/page">some text</a><non-standart-tag><a href="/page-2">more text</a></non-standart-tag>';

libxml_use_internal_errors(true); //suppress errors when importing invalid HTML
$dom = new DOMDocument();
$dom->loadHTML($htmlString);
$xpath = new DOMXPath($dom);

$links = [];
$linksAsString = '';

foreach ($xpath->query('//a') as $linkElement){
    /**
     * @var DOMElement $linkElement
     */
    $link = [
        'href' => $linkElement->getAttribute('href'),
        'text' => $linkElement->textContent
    ];
    $links[] = $link;
    $linksAsString .= $link['text'] . "[{$link['href']}] ";
}
libxml_clear_errors();

var_dump($links);
echo $linksAsString;

edited Feb 24 '16 at 09:07

answered Feb 21 '16 at 23:12

Aleksey Ratnikov

569
3
11

Added code to implode links into string with square brackets. – Aleksey Ratnikov Feb 21 '16 at 23:18
I would love to get this to work because it looks like a great elegant solution. But I can't get my HTML to validate because it's generated by a WP script that I can't control and so it won't parse as XML. :( – isabisa Feb 23 '16 at 01:45
I don't understand - in question you said that you want to extract links from HTML with regex, but now you say that you can't get HTML. If you can't intercept WP script, you may just get it's out - it is and HTML anyway. – Aleksey Ratnikov Feb 23 '16 at 10:15
It's a string that contains HTML, but the HTML isn't validating. I just don't have control over the HTML that is output and can't make it validate, so it won't convert to XML. – isabisa Feb 24 '16 at 02:01
Got it. Corrected answer to use DOMDocument that is able to process invalid HTML (SimpleXMLElement can't). – Aleksey Ratnikov Feb 24 '16 at 09:08

PHP regex: How to convert HTML string with links into plain text that shows URL after text in brackets

1 Answers1

Linked