Remove links with hash symbol from string

Question

I have a database with HTML content and it has some text with links. Some texts have hash symbol in their URLs, some others no.

I need to delete the links with hash symbol, keeping those with no hash symbol on it.

Example:

Input:

<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li><a   href="http://example.com/books/1#c1" >Chapter 1</a></li>
    <li><a name="name before href" href="http://example.com/books/1#c2">Chapter 2</a></li>
    <li><a href="http://example.com/books/1#c3" name="name after href">Chapter 3</a></li>
    <li><a href="http://example.com/books/1#cN" target="_blank">Chapter N</a></li>
</ul>
<br><br>
<a href="http://example.com/books/1">Harry Potter</a>
<ul>
    <li><a href="http://example.com/books/2#c1" target="_self">Chapter 1</a></li>
    <li><a href="http://example.com/books/2#c2" name="some have name" title="some others have title" >Chapter 2</a></li>
    <li><a href="http://example.com/books/2#c3">Chapter 3</a></li>
    <li><a href="http://example.com/books/2#cN"  >Chapter N</a></li>
</ul>

Desired Output:

<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>
<br><br>
<a href="http://example.com/books/2">Harry Potter</a>
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>

I am trying with this code, but it delete all the links and I want to keep those with no hash symbol.

$content = preg_replace('#<a.*?>([^>]*)</a>#i', '$1', $content);

So, currently I am getting this:

The Lord of the Rings
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>
<br><br>
Harry Potter
<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul>

More details:

I am using PHP.
The only reference I have to know what links to delete is de # symbol.
Some links have new line.

Example:

<a href="http://example.com">
    new line</a>
or
<a href="http://example.com">new
    line</a>

What have you tried? Where are you stuck? Post your code thus far. — fubar, Jan 08 '18 at 01:03
Parsing inconsitent HTML with regular expressions is very unreliable. Is this HTML that you've written, and can standardise, or are you crawling another site for it? You'd probably be better served using something like `DOMDocument`, if the HTML is even valid. — fubar, Jan 08 '18 at 01:20
@fubar is spot on with the `DOMDocument` and if you use that in conjunction with my answer, you can parse through the HTML and delete the lines with a `#`. Regex is good for a single line, but regex by itself can't parse the document. — Capattax, Jan 08 '18 at 01:37

score 5 · Answer 1 · answered Jan 08 '18 at 01:50

You should avoid using regex, instead you should use DOMDocument and DOMXPath.

<?php
$dom = new DOMDocument();

$dom->loadHtml('
<a href="http://example.com/books/1">The Lord of the Rings</a>
<ul>
    <li><a   href="http://example.com/books/1#c1" >Chapter 1</a></li>
    <li><a name="name before href" href="http://example.com/books/1#c2">Chapter 2</a></li>
    <li><a href="http://example.com/books/1#c3" name="name after href">Chapter 3</a></li>
    <li><a href="http://example.com/books/1#cN" target="_blank">Chapter N</a></li>
</ul>
<br><br>
<a href="http://example.com/books/1">Harry Potter</a>
<ul>
    <li><a href="http://example.com/books/2#c1" target="_self">Chapter 1</a></li>
    <li><a href="http://example.com/books/2#c2" name="some have name" title="some others have title" >Chapter 2</a></li>
    <li><a href="http://example.com/books/2#c3">Chapter 3</a></li>
    <li><a href="http://example.com/books/2#cN"  >Chapter N</a></li>
</ul>
', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$xpath = new DOMXPath($dom);

foreach ($xpath->query("//a") as $link) {
    $href = $link->getAttribute('href');

    // link has a # in it, so replace with the links title
    if (strpos($href, '#') !== false) {
        $link->parentNode->nodeValue = $link->nodeValue;
    }
}

echo $dom->saveHTML();

https://3v4l.org/8FQYb

Result:

<a href="http://example.com/books/1">The Lord of the Rings<ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul><br><br><a href="http://example.com/books/1">Harry Potter</a><ul>
    <li>Chapter 1</li>
    <li>Chapter 2</li>
    <li>Chapter 3</li>
    <li>Chapter N</li>
</ul></a>

Im my code, it does not solved all the links. but all those not solved by Chromane original answer. Really Appreciate it. — Just a nice guy, Jan 08 '18 at 02:07
Whats it not solving? (update your question with complete HTML example) Im happy to update. — Lawrence Cherone, Jan 08 '18 at 02:08
Chromane's update solved it. Thanks anyway, I appreciate your help and up voted it. :-) — Just a nice guy, Jan 08 '18 at 02:12

score 2 · Accepted Answer · answered Jan 08 '18 at 01:22

2

This regex statement matches the examples you've given. It detects those URL's with a # somewhere in the url. You can then write a replace statement and swap them all the text from capture group \1

<a(?:\s+name=".*?")?\s+href=.*?#.*?>(.*?)<\/a>

Regex in action

answered Jan 08 '18 at 01:22

Chromane

175
1
8

This almost works. Some links have a new lines and does not work o those. I am going to add and example at the bottom, for you to understand. – Just a nice guy Jan 08 '18 at 01:28
Added the example of links with new lines. Your regex works well except for this kind of links. – Just a nice guy Jan 08 '18 at 01:35
Adapted Regex Statement; ([\s\S]+?)<\/a> – Chromane Jan 08 '18 at 02:01
Just, Perfect. Thanks. – Just a nice guy Jan 08 '18 at 02:10

Capattax · Answer 3 · 2018-01-08T01:37:44.770

After parsing through the HTML and selecting all the HTML links, you could use a foreach loop and str_replace on the condition that the string contains a pound/hash symbol.

<?php
//Save HTML code as an object using DOMDocument ($links) for parsing
foreach($links as $line) {
    if (str_pos($line, '#')) {
        str_replace($line, '', $links);
    }
}
?>

This would replace each line with a pound/hash symbol with a blank line and would be treated as such by the database.

Wray Zheng · Answer 4 · 2018-01-08T01:34:06.770

0

Use following pattern to match <a href=...> and </a> in the text, and replace the matched text with empty string.

(?<=<li>)<a.+?>|</a>(?=</li>)

This is to remove strings unwanted, instead of replacing whole text with wanted.

edited Jan 08 '18 at 01:34

answered Jan 08 '18 at 01:24

Wray Zheng

967
10
17

This is a good idea, but not all the unwanted links are inside of `
` tags. But I know the part I used in the example does... thanks anyway.

Just a nice guy

Jan 08 '18 at 01:38

Remove links with hash symbol from string

4 Answers4