Remove html entities and extract text content using regex

Question

I have a text containing just HTML entities such as < and   I need to remove this all and get just the text content:

&nbspHello there&lt;testdata&gt;

So, I need to get Hello there and testdata from this section. Is there any way of using negative lookahead to do this?

I tried the following: /((?!&.+;).)+/ig but this doesnt seem to work very well. So, how can I just extract the required text from there?

score 23 · Answer 1 · edited Mar 18 '21 at 22:30

23

A better syntax to find HTML entities is the following regular expression:

/&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-fA-F]{1,6});/ig

This syntax ignores false entities.

edited Mar 18 '21 at 22:30

Kevin Doyon

3,464
2
33
38

answered Jun 07 '19 at 08:39

Mahoor13

5,297
5
23
24

This doesn't necessarily matter, but it's worth noting that this is technically not comprehensive. `&amp`, `{`, and `{` are all valid HTML entities that won't be matched by this. – Grant Gryczan Sep 10 '21 at 03:05
[a-z0-9]+ matches &amp and similar forms, and #[0-9]{1,6} matches all entities from to 󴈿 . I think other forms are not useful. – Mahoor13 Sep 11 '21 at 12:59
It matches `&`, not `&amp`. Your regex requires a semicolon, but `&amp` is a valid HTML entity. And I didn't say anything about whether those forms of entities are useful. I only said this regex is not comprehensive. If someone needed a comprehensive regex for their use case, this would not work. – Grant Gryczan Sep 11 '21 at 17:30

score 4 · Accepted Answer · edited May 23 '17 at 11:57

Here are 2 suggestions:

1) Match all the entities using /(&.+;)/ig. Then, using whatever programming language you are using, replace those matches with an empty string. For example, in php use preg_replace; in C# use Regex.Replace. See this SO for a similar solution that accounts for more cases: How to remove html special chars?

2) If you really want to do this using the plaintext portions, you could try something like this: /(?:^|;)([^&;]+)(?:&|$)/ig. What its actually trying to do it match the pieces between; and & with special cases for start and end without entities. This is probably not the way to go, you're likely to run into different cases this breaks.

Thanks, tried 2-Just got back from the looney bin. I'll go with 1. — Mkl Rjv, Oct 08 '14 at 10:59

score 1 · Answer 3 · answered Oct 14 '20 at 16:31

1

It's language specific but in Python you can use html.unescape (MAN). Like:

import html
print(html.unescape("This string contains &amp; and &gt;"))
#prints: This string contains & and >

answered Oct 14 '20 at 16:31

gneusch

125
6

score 0 · Answer 4 · answered Jul 13 '23 at 13:34

After a short look at the python documentation one can come across the html.parser module: https://docs.python.org/3/library/html.parser.html#module-html.parser

And after some short prototyping one can come up with the fairly simple code:

from html.parser import HTMLParser

line_with_html = 'Data before tag with <span style="color:var(--md-font-color-green)">some gren text</span> with a nice logo'


class CleanHTML(HTMLParser):
    def reset(self) -> None:
        self.extracted_data = ""
        return super().reset()

    def remove_tags(self, html_data: str) -> str:
        """
        Args:
            html_data (str): HTML data which might contain tags.

        Returns:
            str: Data without any HTML tags. Forces feeding of any buffered data.
        """
        self.reset()
        self.feed(html_data)
        self.close()
        return self.extracted_data

    def handle_data(self, data: str) -> None:
        """
        Args:
            data (str): Html data extracted from tags to be processed.
        """
        self.extracted_data += data


p = CleanHTML()
print(p.remove_tags(line_with_html))

No need to:

Use regular expression
Use third-party modules like BeautifulSoup
Use parsers whih were not intended for HTML, like the XML parser

Remove html entities and extract text content using regex

4 Answers4

Linked