2

I'm looking for a way in PHP (with regex, maybe?) to convert a string of HTML that includes links into a string of plain text that adds the URL of the link after the text.

Here's an example of what I'm thinking:

$html = '<p><a href="http://www.example.com/maybe/something/here/">Link name</a> 
        for something or another. <a href="https://www.examplesecure.com/">Another link
        </a> to something else.</p>'

// Regex to find the URLs
????

// Add the found URLs as strings after the closing a tags
????

// Convert to plain text
$text = trim(strip_tags($html));

Ideally, I'd end up with this string:

Link name [http://www.example.com/maybe/something/here/] for something or another.
Another link [https://www.examplesecure.com/] to something else.
mymiracl
  • 583
  • 1
  • 16
  • 24
isabisa
  • 649
  • 2
  • 10
  • 23
  • 2
    Do not use Regular Expressions to parse HTML. Never. Ever. Try [DomCrawler](http://symfony.com/doc/current/components/dom_crawler.html), or [DomDocument](http://php.net/manual/en/class.domdocument.php) for a native solution. – BugHunterUK Feb 21 '16 at 19:32
  • I would use a DOM parser like Simple HTML DOM Parser to explode html. If you just are looking for the URL's, take a look at https://stackoverflow.com/questions/11588542/get-all-urls-in-a-string-with-php/11588614#11588614 – redelschaap Feb 21 '16 at 19:35
  • So no preg_match, just to find the URLs? – isabisa Feb 21 '16 at 19:47

1 Answers1

0

Use SimpleXMLelemnt for this:

    $htmlString = '<div><p><a href="/page">some text</a><non-standart-tag><a href="/page-2">more text</a></non-standart-tag>';

libxml_use_internal_errors(true); //suppress errors when importing invalid HTML
$dom = new DOMDocument();
$dom->loadHTML($htmlString);
$xpath = new DOMXPath($dom);

$links = [];
$linksAsString = '';

foreach ($xpath->query('//a') as $linkElement){
    /**
     * @var DOMElement $linkElement
     */
    $link = [
        'href' => $linkElement->getAttribute('href'),
        'text' => $linkElement->textContent
    ];
    $links[] = $link;
    $linksAsString .= $link['text'] . "[{$link['href']}] ";
}
libxml_clear_errors();

var_dump($links);
echo $linksAsString;
Aleksey Ratnikov
  • 569
  • 3
  • 11
  • Added code to implode links into string with square brackets. – Aleksey Ratnikov Feb 21 '16 at 23:18
  • I would love to get this to work because it looks like a great elegant solution. But I can't get my HTML to validate because it's generated by a WP script that I can't control and so it won't parse as XML. :( – isabisa Feb 23 '16 at 01:45
  • I don't understand - in question you said that you want to extract links from HTML with regex, but now you say that you can't get HTML. If you can't intercept WP script, you may just get it's out - it is and HTML anyway. – Aleksey Ratnikov Feb 23 '16 at 10:15
  • It's a string that contains HTML, but the HTML isn't validating. I just don't have control over the HTML that is output and can't make it validate, so it won't convert to XML. – isabisa Feb 24 '16 at 02:01
  • Got it. Corrected answer to use DOMDocument that is able to process invalid HTML (SimpleXMLElement can't). – Aleksey Ratnikov Feb 24 '16 at 09:08