1

I have a database with HTML content and it has some text with links. Some texts have hash symbol in their URLs, some others no.

I need to delete the links with hash symbol, keeping those with no hash symbol on it.

Example:

Input:

<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li><a   href="http://example.com/books/1#c1" >Chapter 1</a></li>
    <li><a name="name before href" href="http://example.com/books/1#c2">Chapter 2</a></li>
    <li><a href="http://example.com/books/1#c3" name="name after href">Chapter 3</a></li>
    <li><a href="http://example.com/books/1#cN" target="_blank">Chapter N</a></li>
</ul>
<br><br>
<a href="http://example.com/books/1">Harry Potter</a>
<ul>
    <li><a href="http://example.com/books/2#c1" target="_self">Chapter 1</a></li>
    <li><a href="http://example.com/books/2#c2" name="some have name" title="some others have title" >Chapter 2</a></li>
    <li><a href="http://example.com/books/2#c3">Chapter 3</a></li>
    <li><a href="http://example.com/books/2#cN"  >Chapter N</a></li>
</ul>

Desired Output:

<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>
<br><br>
<a href="http://example.com/books/2">Harry Potter</a>
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>

I am trying with this code, but it delete all the links and I want to keep those with no hash symbol.

$content = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $content);

So, currently I am getting this:

The Lord of the Rings
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>
<br><br>
Harry Potter
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>

More details:

  • I am using PHP.
  • The only reference I have to know what links to delete is de # symbol.
  • Some links have new line.

Example:

<a href="http://example.com">
    new line</a>
or
<a href="http://example.com">new
    line</a>
Just a nice guy
  • 549
  • 3
  • 19
  • What have you tried? Where are you stuck? Post your code thus far. – fubar Jan 08 '18 at 01:03
  • @fubar Done. Thanks. – Just a nice guy Jan 08 '18 at 01:14
  • 2
    Parsing inconsitent HTML with regular expressions is very unreliable. Is this HTML that you've written, and can standardise, or are you crawling another site for it? You'd probably be better served using something like `DOMDocument`, if the HTML is even valid. – fubar Jan 08 '18 at 01:20
  • 1
    @fubar is spot on with the `DOMDocument` and if you use that in conjunction with my answer, you can parse through the HTML and delete the lines with a `#`. Regex is good for a single line, but regex by itself can't parse the document. – Capattax Jan 08 '18 at 01:37

4 Answers4

5

You should avoid using regex, instead you should use DOMDocument and DOMXPath.

<?php
$dom = new DOMDocument();

$dom->loadHtml('
<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li><a   href="http://example.com/books/1#c1" >Chapter 1</a></li>
    <li><a name="name before href" href="http://example.com/books/1#c2">Chapter 2</a></li>
    <li><a href="http://example.com/books/1#c3" name="name after href">Chapter 3</a></li>
    <li><a href="http://example.com/books/1#cN" target="_blank">Chapter N</a></li>
</ul>
<br><br>
<a href="http://example.com/books/1">Harry Potter</a>
<ul>
    <li><a href="http://example.com/books/2#c1" target="_self">Chapter 1</a></li>
    <li><a href="http://example.com/books/2#c2" name="some have name" title="some others have title" >Chapter 2</a></li>
    <li><a href="http://example.com/books/2#c3">Chapter 3</a></li>
    <li><a href="http://example.com/books/2#cN"  >Chapter N</a></li>
</ul>
', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

foreach ($xpath->query("//a") as $link) {
    $href = $link->getAttribute('href');

    // link has a # in it, so replace with the links title
    if (strpos($href, '#') !== false) {
        $link->parentNode->nodeValue = $link->nodeValue;
    }
}

echo $dom->saveHTML();

https://3v4l.org/8FQYb

Result:

<a href="http://example.com/books/1">The Lord of the Rings<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul><br><br><a href="http://example.com/books/1">Harry Potter</a><ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul></a>
Lawrence Cherone
  • 46,049
  • 7
  • 62
  • 106
2

This regex statement matches the examples you've given. It detects those URL's with a # somewhere in the url. You can then write a replace statement and swap them all the text from capture group \1

<a(?:\s+name=".*?")?\s+href=.*?#.*?>(.*?)<\/a>

Regex in action

Chromane
  • 175
  • 1
  • 8
0

After parsing through the HTML and selecting all the HTML links, you could use a foreach loop and str_replace on the condition that the string contains a pound/hash symbol.

<?php
//Save HTML code as an object using DOMDocument ($links) for parsing
foreach($links as $line) {
    if (str_pos($line, '#')) {
        str_replace($line, '', $links);
    }
}
?>

This would replace each line with a pound/hash symbol with a blank line and would be treated as such by the database.

Capattax
  • 131
  • 3
  • 16
0

Use following pattern to match <a href=...> and </a> in the text, and replace the matched text with empty string.

(?<=<li>)<a.+?>|</a>(?=</li>)

This is to remove strings unwanted, instead of replacing whole text with wanted.

Wray Zheng
  • 967
  • 10
  • 17
  • This is a good idea, but not all the unwanted links are inside of `
  • ` tags. But I know the part I used in the example does... thanks anyway.
  • – Just a nice guy Jan 08 '18 at 01:38