How to make a Regex Pattern for HTML Simple Text?

Question

I am trying to learn Regex patterns for a class. I am making a simple HTML Lexer/Parser. I know this is not the best or most efficient way to make a Lexer/Parser but it is only to understand Regex patterns.

So my question is, How do I create a pattern that checks if the String does not contain any HTML tags (ie <TAG>) and does not contain any HTML Entities (ie &ENT;)?

This is what I could come up with so far but it still does not work:

.+?(^(?:&[A-Za-z0-9#]+;)^(?:<.*?>))

EDIT: The only problem is that I can't negate the final outcome I need to find a complete pattern that would accomplish this task if it's possible, although it might not be pretty. I never mentioned but it's pretty much supposed to match any Simple Text in an HTML page.

possible duplicate of [RegEx match open tags except XHTML self-contained tags](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — Ben Jackson, Dec 10 '10 at 20:28
Why can't you negate the pattern? I don't get your reasoning... — Platinum Azure, Dec 11 '10 at 15:36
You could copy your HTML string and then use the regex patterns below to get rid of the HTML tags and entities (substitute the patterns with nothing). That leaves you with plain text (though the entities are gone instead of translated to their actual characters). — Platinum Azure, Dec 11 '10 at 15:38

score 2 · Answer 1 · answered Dec 10 '10 at 20:34

You could use the expression <.+?>|&.+?; to search for a match, and then negate the result.

<.+?> says first a < then anything (one or more times) then a >
&.+?; says first a & then anything (one or more times) then a ;

Here is a complete example with an ideone.com demo here.

import java.util.regex.*;

public class Test {
    public static void main(String[] args) {
        String[] tests = { "hello", "hello <b>world</b>!", "Hello&nbsp;world" };
        Pattern p = Pattern.compile("<.+?>|&.+?;");
        for (String test : tests) {
            Matcher m = p.matcher(test);
            if (m.find())
                System.out.printf("\"%s\" has HTML: %s%n", test, m.group());
            else
                System.out.printf("\"%s\" does have no HTML%n", test);
        }
    }
}

Output:

"hello" does have no HTML
"hello <b>world</b>!" has HTML: <b>
"Hello&nbsp;world" has HTML: &nbsp;

Platinum Azure · Accepted Answer · 2010-12-10T20:40:03.107

1

If you're looking to match strings that do NOT follow a pattern, the simplest thing to do is to match the pattern and then negate the result of the test.

<[^>]+>|&[^;]+;

Any string that matches this pattern will have AT LEAST ONE tag (as you've defined it) or entity (as you've defined it). So the strings you want are strings that DO NOT match this pattern (they will have NO tags or entities).

edited Dec 10 '10 at 20:40

answered Dec 10 '10 at 20:29

Platinum Azure

45,269
12
110
134

I would change both `*` to a `+` and remove the capturing group. – aioobe Dec 10 '10 at 20:30
Would this be possible? ^(?:<[^>]+>|&[^;]+;) – Free Lancer Dec 10 '10 at 23:18
ie: Grouping the pattern and then negating the whole thing within the pattern. – Free Lancer Dec 10 '10 at 23:19
No, because you can't negate a pattern, only a character class. The `^` character, outside a character class, works differently: it anchors the pattern to the beginning of the string. (That's a fancy way of saying the string needs to start with the pattern, not merely contain it) – Platinum Azure Dec 11 '10 at 15:34

How to make a Regex Pattern for HTML Simple Text?

2 Answers2