Check if a string is html or not

Question

I have a certain string for which I want to check if it is a html or not. I am using regex for the same but not getting the proper result.

I validated my regex and it works fine here.

var htmlRegex = new RegExp("<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)</\1>");
return htmlRegex.test(testString);

Here's the fiddle but the regex isn't running in there. http://jsfiddle.net/wFWtc/

On my machine, the code runs fine but I get a false instead of true as the result. What am missing here?

Use an HTML parser to parse HTML. Please read [this](http://stackoverflow.com/a/1732454/464709) if you haven't already. — Frédéric Hamidi, Mar 17 '13 at 08:25
the question keep coming, there should be a stack bot that will aoutmatically set a comment on every question with html and regex in it — Bartlomiej Lewandowski, Mar 17 '13 at 08:26
It kinda depends on what level of sophistication you want from the check. You could check if the string contains at least one `<` and at least one `>` and call it HTML, or you could check that it is strictly valid with correct HTML syntax, or anything from between. For the simplest of cases a HTML parser is not necessary. — JJJ, Mar 17 '13 at 08:30
@Juhana : ^ I second Juhana's comment. I really don't have a need here to validate html by its tag names. — user1240679, Mar 17 '13 at 08:32
But yes, the level of sophistication depends on *why* the check is done in the first place, as @nhahtdh says. — JJJ, Mar 17 '13 at 08:33
In which case, I can reframe my question to say that it should only check if the string is some kind of valid markup format (not necessarily html) — user1240679, Mar 17 '13 at 08:34
@user1240679: Valid markup format? What kind of validity? In the strictest sense, you need DTD to describe it. In a loose sense, you may want to check that the tags are matched up properly. Either of the 2 cases above are not job for regex. — nhahtdh, Mar 17 '13 at 08:36
Use the jquery [$.parseHTML](https://api.jquery.com/jquery.parsehtml/) function. See my answer [here](https://stackoverflow.com/a/52319805/7988857). — stomtech, Sep 13 '18 at 18:42

score 414 · Accepted Answer · edited Oct 04 '19 at 11:33

414

A better regex to use to check if a string is HTML is:

/^/

For example:

/^/.test('') // true
/^/.test('foo bar baz') //true
/^/.test('<p>fizz buzz</p>') //true

In fact, it's so good, that it'll return true for every string passed to it, which is because every string is HTML. Seriously, even if it's poorly formatted or invalid, it's still HTML.

If what you're looking for is the presence of HTML elements, rather than simply any text content, you could use something along the lines of:

/<\/?[a-z][\s\S]*>/i.test()

It won't help you parse the HTML in any way, but it will certainly flag the string as containing HTML elements.

edited Oct 04 '19 at 11:33

DerpyNerd

4,743
7
41
92

answered Mar 17 '13 at 08:43

zzzzBov

174,988
54
320
367

Well, this answer demonstrate how the **purpose** of HTML detection comes in play. The prime example is jQuery, where it has to detect between HTML and CSS selector when a string is passed in. – nhahtdh Mar 17 '13 at 08:50
109

I'm honestly surprised I didn't get more downvotes for the snark. – zzzzBov Mar 17 '13 at 17:56
Hum i would say that this `/<[\s\S]*>/i.test()` is sufficient – clenemt Feb 04 '15 at 13:31
10

@clenemt, so you consider `a < b && a > c` to be HTML? – zzzzBov Feb 04 '15 at 14:20
1

@zzzzBov you know that you consider `ac` to be HTML... I wish HTML detection could be simplified that much. Parsing is never easy. – oriadam Mar 15 '16 at 17:00
2

@oriadam, the context was for detecting elements in that case. If you use `a < b && a > c` the browser will turn the `>` and `<` characters into `>` and `<` entities appropriately. If, instead, you use `ac` the browser will interpret the markup as `ac` because the lack of a space means that `` element. [Here's a quick demo of what I'm talking about](https://jsfiddle.net/utf9027v/). – zzzzBov Mar 15 '16 at 17:24
Personally, this version fits my needs better `^<([a-z]+)[^>]*>[\S\s]*<\/\1>$` – Jonathan Parent Lévesque Jun 09 '17 at 12:45
@JonathanParentLévesque, you will run into issues with custom elements, which must include `-` characters in their tag names. – zzzzBov Jun 09 '17 at 13:10
@zzzzBov Personally I don't use custom tags (which are not officially considered "valid HTML5" by the W3C Consortium), but it can be supported with this variation https://regex101.com/r/gOQIFi/1. I used this answer to validate the rules: https://stackoverflow.com/questions/30258630/what-are-valid-html5-custom-tags – Jonathan Parent Lévesque Jun 09 '17 at 13:58
This one should detect the end tags as well e.g or
. /<\/?[a-z][\s\S]*>/i
– Waheed Nov 05 '17 at 08:04
1

@zzzzBov well, you should've gotten more downvotes cause this is not true and you know it: "every string is HTML". It's a markup language so it's not html if it doesn't contain any markup instructions. But you're funny :) – Marvin Saldinger Mar 26 '18 at 14:13
@MarvinSaldinger "it's not html if it doesn't contain any markup instructions" If I were to write an `example.html` file that contained only the text `lorem ipsum`, it could certainly be considered HTML. The files contents may also be considered to be invalid, but so too are many other useful webpages, so that distinction can't be used to delineate "HTML" from "not HTML". I respect your disagreement with my answer, but I'm going to disagree with your assertion that I'm purposefully propagating false information. If you think you have a better answer, please by all means write it. – zzzzBov Mar 26 '18 at 14:35
21

This is probably the highest voted troll answer I've seen on so. ;) – aandis Jan 20 '19 at 17:17
Thanks! I added `\/` to include self closing tags like ``: `<\/?[a-z][\s\S]+>` – DerpyNerd Oct 04 '19 at 11:26
4

Got a downvote from me, alas, and I bathe in snark. There are two different questions, seemingly both being answered, when they are quite distinct. "Does this text contain HTML markup of any kind" → "does it contain `<[a-zA-Z]` followed eventually by a `>`", and there you go. Second: is the string an HTML document? Does it start with ` `? [Because:](https://tomhodgins.hashnode.dev/code-that-you-just-never-ever-need-to-write-cjpblnfff00km0ys149kbttbg) ` Yes, really.
This is everything you need for a fully valid, complete HTML document.`
– amcgregor Feb 05 '20 at 13:55
@amcgregor, checking for the presence of ` ` is significantly problematic, as it will fail for older doctypes, as well as older webpages before a doctype was standardized (which was mostly used to avoid quirksmode). So by your own logic, you wouldn't consider [the first webpage](http://info.cern.ch/hypertext/WWW/TheProject.html) to contain HTML. Your example includes the statement that a doctype is necessary for a valid and complete HTML document, but those requirements are not part of the question as stated. – zzzzBov Feb 05 '20 at 15:38
@zzzzBov Interesting example link you provide there. Correct, I do not consider that HTML. It's some form of mangled almost-HTML, certainly. However, use of `
` and `` was unnecessary as neither have attributes assigned to them. ;^D However! It absolutely _would_ match my "HTML fragment" / "likely HTML" heuristic regular expression. It is **not** a complete HTML document. Even if it _can_ be interpreted that way. [I highlight in my own answer](https://stackoverflow.com/a/60077957/211827) that there are two very distinct questions being asked.
– amcgregor Feb 05 '20 at 19:26
@zzzzBov Ah, reviewing my response history this morning, additional point on that "first webpage": HTML didn't actually _exist_ formally at that time. That "document" as a "document" is an SGML fragment, **not HTML**, as Tim Burners-Lee was adapting SGML to the context of hypertext. [This bit of web archaeology](http://infomesh.net/html/history/early/) describes an even earlier page from 1990, yours circa Dec 1992, and also points at earlier work pre-1990 via GML. – amcgregor Feb 27 '20 at 16:39
1

This has serious [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/4028303) vibes – Jacob Stamm Jan 28 '22 at 18:55

dfsq · Answer 2 · 2017-12-15T11:00:25.940

106

Method #1. Here is the simple function to test if the string contains HTML data:

function isHTML(str) {
  var a = document.createElement('div');
  a.innerHTML = str;

  for (var c = a.childNodes, i = c.length; i--; ) {
    if (c[i].nodeType == 1) return true; 
  }

  return false;
}

The idea is to allow browser DOM parser to decide if provided string looks like an HTML or not. As you can see it simply checks for ELEMENT_NODE (nodeType of 1).

I made a couple of tests and looks like it works:

isHTML('<a>this is a string</a>') // true
isHTML('this is a string')        // false
isHTML('this is a <b>string</b>') // true

This solution will properly detect HTML string, however it has side effect that img/vide/etc. tags will start downloading resource once parsed in innerHTML.

Method #2. Another method uses DOMParser and doesn't have loading resources side effects:

function isHTML(str) {
  var doc = new DOMParser().parseFromString(str, "text/html");
  return Array.from(doc.body.childNodes).some(node => node.nodeType === 1);
}

_{Notes:
1. Array.from is ES2015 method, can be replaced with [].slice.call(doc.body.childNodes).
2. Arrow function in some call can be replaced with usual anonymous function.}

edited Dec 15 '17 at 11:00

answered Mar 17 '13 at 08:40

dfsq

191,768
25
236
258

4

This's an awesome idea. However, this function could not detect closing tag (i.e. `isHTML("") --> false`). – Lewis Apr 02 '14 at 18:11
12

Great solution!.. The only negative side-affect of is that if your html contains any static resources like an image src attribute.. `innerHTML` will force the browser to start fetching those resources. :( – Jose Browne Jul 02 '14 at 09:46
1

@JoseBrowne even if it's not appended to the DOM? – kuus Jan 29 '17 at 20:15
2

@kuus Yes, even if not appending. Use DOMParser solution. – dfsq Jan 29 '17 at 20:21
this is much better than the accepted answers that's just being pedantic and doesn't answer the *real* question – supersan May 23 '19 at 19:18
1

Good idea, but wouldn't the accepted answer be better for performance? Especially if you have huge strings (pun intended) or if you have to use this test a lot. – DerpyNerd Oct 04 '19 at 11:30
@DerpyNerd Yes, DOM Parser solution would be significantly slower in such cases. – dfsq Oct 04 '19 at 20:03
notice that with DOMParser, some elements (link, script, meta, ..) when placed first will appear inside the of the document, so they won't be detected by your fn: ```isHTML('') === false``` – nachoab Sep 10 '20 at 10:34
1

Not to mention XSS doesn't seem to execute when using this method as its not added to the DOM! – David Kroukamp Oct 06 '21 at 10:36
1

Please do not use solution 1 and **do not put it in a library**. This is how [XSS](https://owasp.org/www-community/attacks/xss/) happens. – Arye Eidelman Mar 06 '22 at 20:50

Johan Dettmar · Answer 3 · 2023-01-19T09:59:33.730

23

Here's a sloppy one-liner that I use from time to time:

var isHTML = RegExp.prototype.test.bind(/(<([^>]+)>)/i);

It will basically return true for strings containing a < followed by SOMETHING followed by >.

By SOMETHING, I mean basically anything except an empty string.

It's not great, but it's a one-liner.

Usage

isHTML('Testing');               // false
isHTML('<p>Testing</p>');        // true
isHTML('<img src="hello.jpg">'); // true
isHTML('My < weird > string');   // true (caution!!!)
isHTML('<>');                    // false
isHTML('< >');                   // true (caution!!!)
isHTML('2 < 5 && 5 > 3');        // true (caution!!!)

As you can see it's far from perfect, but might do the job for you in some cases.

edited Jan 19 '23 at 09:59

answered Apr 21 '16 at 14:50

Johan Dettmar

27,968
5
31
28

1

that's a lovely one-liner... never considered `.test.bind` – John Doherty Dec 29 '22 at 12:32
Can anyone explain this line of code? – mkvakin Mar 13 '23 at 05:15

CSᵠ · Answer 4 · 2013-03-17T09:57:08.577

17

A little bit of validation with:

/<(?=.*? .*?\/ ?>|br|hr|input|!--|wbr)[a-z]+.*?>|<([a-z]+).*?<\/\1>/i.test(htmlStringHere)

This searches for empty tags (some predefined) and / terminated XHTML empty tags and validates as HTML because of the empty tag OR will capture the tag name and attempt to find it's closing tag somewhere in the string to validate as HTML.

Explained demo: http://regex101.com/r/cX0eP2

Update:

Complete validation with:

/<(br|basefont|hr|input|source|frame|param|area|meta|!--|col|link|option|base|img|wbr|!DOCTYPE).*?>|<(a|abbr|acronym|address|applet|article|aside|audio|b|bdi|bdo|big|blockquote|body|button|canvas|caption|center|cite|code|colgroup|command|datalist|dd|del|details|dfn|dialog|dir|div|dl|dt|em|embed|fieldset|figcaption|figure|font|footer|form|frameset|head|header|hgroup|h1|h2|h3|h4|h5|h6|html|i|iframe|ins|kbd|keygen|label|legend|li|map|mark|menu|meter|nav|noframes|noscript|object|ol|optgroup|output|p|pre|progress|q|rp|rt|ruby|s|samp|script|section|select|small|span|strike|strong|style|sub|summary|sup|table|tbody|td|textarea|tfoot|th|thead|time|title|tr|track|tt|u|ul|var|video).*?<\/\2>/i.test(htmlStringHere)

This does proper validation as it contains ALL HTML tags, empty ones first followed by the rest which need a closing tag.

Explained demo here: http://regex101.com/r/pE1mT5

edited Mar 17 '13 at 09:57

answered Mar 17 '13 at 09:29

CSᵠ

10,049
9
41
64

1

Just a note the bottom regex does work but it won't detect unclosed html tags such as "'hello world". granted this is broken html therefore should be treated as a string but for practical purposes your app may want to detect these too. – Sep 21 '16 at 19:39
1

HTML is designed with the forgiveness of user-agents in mind. "Invalid" tags are not invalid, they're just unknown, and permitted. "Invalid" attributes are not invalid… This is particularly notable when one begins to involve "web components" and technologies like JSX, which mix HTML and richer component descriptions, typically generating shadow DOM. Slap [this](https://paste.webcore.io/?%3C!DOCTYPE+html%3E%0A%3Ctitle%3Etesting%3C%2Ftitle%3E%0A%3Cp%3EThis+is+a+test.%3C%2Fp%3E%0A%3Cstrange%3EThis+is+strange.%3C%2Fstrange%3E) in a file and eval `document.querySelector('strange')` — it'll work. – amcgregor Feb 27 '20 at 16:46
(To summarize: due to how the specification is written, attempting to "validate" HTML markup is essentially a fool's errand. The link given to a sample HTML document with an "invalid" element, there, is a [100% fully-formed, complete HTML document](https://tomhodgins.hashnode.dev/code-that-you-just-never-ever-need-to-write-cjpblnfff00km0ys149kbttbg)—and has been since 1997—as another example.) – amcgregor Feb 27 '20 at 16:49

score 13 · Answer 5 · edited May 23 '17 at 12:02

13

zzzzBov's answer above is good, but it does not account for stray closing tags, like for example:

/<[a-z][\s\S]*>/i.test('foo </b> bar'); // false

A version that also catches closing tags could be this:

/<[a-z/][\s\S]*>/i.test('foo </b> bar'); // true

edited May 23 '17 at 12:02

Community

1
1

answered Aug 19 '14 at 10:24

AeonOfTime

956
7
10

Could have been better to suggest an edit, instead of posting this as a comment. – Zlatin Zlatev Sep 13 '16 at 09:27
I think you mean `<[a-z/][\s\S]*>` - note the slash in the first group. – Ryan Guill May 09 '17 at 17:38

score 8 · Answer 6 · answered Jul 13 '18 at 13:12

All of the answers here are over-inclusive, they just look for < followed by >. There is no perfect way to detect if a string is HTML, but you can do better.

Below we look for end tags, and will be much tighter and more accurate:

import re
re_is_html = re.compile(r"(?:</[^<]+>)|(?:<[^<]+/>)")

And here it is in action:

# Correctly identified as not HTML:
print re_is_html.search("Hello, World")
print re_is_html.search("This is less than <, this is greater than >.")
print re_is_html.search(" a < 3 && b > 3")
print re_is_html.search("<<Important Text>>")
print re_is_html.search("<a>")

# Correctly identified as HTML
print re_is_html.search("<a>Foo</a>")
print re_is_html.search("<input type='submit' value='Ok' />")
print re_is_html.search("<br/>")

# We don't handle, but could with more tweaking:
print re_is_html.search("<br>")
print re_is_html.search("Foo &amp; bar")
print re_is_html.search("<input type='submit' value='Ok'>")

nnnnnn · Answer 7 · 2013-03-17T08:43:09.900

If you're creating a regex from a string literal you need to escape any backslashes:

var htmlRegex = new RegExp("<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>");
// extra backslash added here ---------------------^ and here -----^

This is not necessary if you use a regex literal, but then you need to escape forward slashes:

var htmlRegex = /<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>/;
// forward slash escaped here ------------------------^

Also your jsfiddle didn't work because you assigned an onload handler inside another onload handler - the default as set in the Frameworks & Extensions panel on the left is to wrap the JS in an onload. Change that to a nowrap option and fix the string literal escaping and it "works" (within the constraints everybody has pointed out in comments): http://jsfiddle.net/wFWtc/4/

~~As far as I know JavaScript regular expressions don't have back-references. So this part of your expression:~~

</\1>

won't work in JS (but would work in some other languages).

There is: https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/RegExp — nhahtdh, Mar 17 '13 at 08:34
Well, this will test that one of the tags looks OK, but nothing about the rest. Not sure what sort of "validity" that OP wants. — nhahtdh, Mar 17 '13 at 08:39

gtournie · Answer 8 · 2016-05-19T19:08:33.903

4

With jQuery:

function isHTML(str) {
  return /^<.*?>$/.test(str) && !!$(str)[0];
}

edited May 19 '16 at 19:08

answered Nov 19 '13 at 14:07

gtournie

4,143
1
21
22

2

`isHTML("");` // returns true `isHTML("div");` // returns true if there are `div`s on the page – ACK_stoverflow Jan 13 '14 at 19:46
@yekta - What are you taking about? This is supposed to check wether the string is html or not. An email is not an html tag as far as I know... isHTML('foo@bar.com') -> false // correct – gtournie May 19 '16 at 19:02
1

A string can be anything, if you know its an HTML tag then why check if its HTML in the first place, I don't quite follow your point. The `@` is not a valid syntax for a selector. Thus when you pass it to a jQuery selector, it will throw an exception (i.e. `$("you@example.com")` from `!!$(str)[0]`). I'm specifically referring to the `!!$(str)[0]` portion. You just edited your answer, but now you're checking for HTML before jQuery does anything. – yekta May 19 '16 at 20:42
I don't think the author wanted to check if it was just a string. That's the point. What he wanted was a function able to check if the string was a valid HTML **tag**, not just HTML (otherwise this is a bit stupid). I updated my answer after I read @ACK_stoverflow comment, but I'm sure a simple regex should do it. – gtournie May 19 '16 at 22:07

score 3 · Answer 9 · answered Feb 05 '16 at 04:09

3

/<\/?[^>]*>/.test(str) Only detect whether it contains html tags, may be a xml

answered Feb 05 '16 at 04:09

shinate

39
2

`27 is < 42, and 96 > 42.` This is not HTML. – amcgregor Aug 22 '20 at 02:56

score 3 · Answer 10 · answered Jun 11 '17 at 05:25

3

Using jQuery in this case, the simplest form would be:

if ($(testString).length > 0)

If $(testString).length = 1, this means that there is one HTML tag inside textStging.

answered Jun 11 '17 at 05:25

Christo Peev

39
1

1

As per the answer just below (starting with "With jQuery", written four years prior to this one!), consider the poor choice of multiple uses from a single entry point. `$()` is a CSS selector operation. But also a DOM node factory from textual HTML serialization. But also… as per the other answer suffering from the same dependence on jQuery, "div" is not HTML, but that would return `true` if any `
` elements exist on the page. This is a very, very bad approach, as I have grown to expect with almost any solution needlessly involving jQuery. (Let it die.)
– amcgregor Jun 24 '20 at 16:34

score 3 · Answer 11 · answered Jun 03 '21 at 12:53

3

While this is an old thread, I just wanted to share the solution I've wrote for my needs:

function isHtml(input) {
    return /<[a-z]+\d?(\s+[\w-]+=("[^"]*"|'[^']*'))*\s*\/?>|&#?\w+;/i.test(input);
}

It should cover most of the tricky cases I've found in this thread. Tested on this page with document.body.innerText and document.body.innerHTML.

I hope it will be useful for someone. :)

answered Jun 03 '21 at 12:53

onestep.ua

118
7

Seems over-specific or attempting to more explicitly _validate_ the HTML. `` may be problematic as attributes without quoted values are perfectly acceptable, making this expression wrong. HTML is a permissive, not strict process, with [my answer](https://stackoverflow.com/a/60077957/211827) providing a more functionally complete (if eager) matching pattern. Test failures, not just success. – amcgregor Oct 05 '21 at 09:23
Worked perfect for me thanks! – Spencer Bigum Oct 20 '21 at 02:56

amcgregor · Answer 12 · 2020-05-15T15:44:11.350

There are fancy solutions involving utilizing the browser itself to attempt to parse the text, identifying if any DOM nodes were constructed, which will be… slow. Or regular expressions which will be faster, but… potentially inaccurate. There are also two very distinct questions arising from this problem:

Q1: Does a string contain HTML fragments?

Is the string part of an HTML document, containing HTML element markup or encoded entities? This can be used as an indicator that the string may require bleaching / sanitization or entity decoding:

/</?[a-z][^>]*>|(\&(?:[\w\d]+|#\d+|#x[a-f\d]+);/

You can see this pattern in use against all of the examples from all existing answers at the time of this writing, plus some… rather hideous WYSIWYG- or Word-generated sample text and a variety of character entity references.

Q2: Is the string an HTML document?

The HTML specification is shockingly loose as to what it considers an HTML document. Browsers go to extreme lengths to parse almost any garbage text as HTML. Two approaches: either just consider everything HTML (since if delivered with a text/html Content-Type, great effort will be expended to try to interpret it as HTML by the user-agent) or look for the prefix marker:

<!DOCTYPE html>

In terms of "well-formedness", that, and almost nothing else is "required". The following is a 100% complete, fully valid HTML document containing every HTML element you think is being omitted:

<!DOCTYPE html>
<title>Yes, really.</title>
<p>This is everything you need.

Yup. There are explicit rules on how to form "missing" elements such as <html>, <head>, and <body>. Though I find it rather amusing that SO's syntax highlighting failed to detect that properly without an explicit hint.

score 1 · Answer 13 · answered Dec 11 '20 at 16:10

Since the original request is not say the solution had to be a RegExp, just that an attempt to use a RegExp was being made. I will offer this up. It says something is HTML if a single child element can be parsed. Note, this will return false if the body contains only comments or CDATA or server directives.

const isHTML = (text) => {
  try {
    const fragment = new DOMParser().parseFromString(text,"text/html");
    return fragment.body.children.length>0
  } catch(error) { ; }  
  return false;
}

score 1 · Answer 14 · answered Dec 08 '22 at 05:56

1

The best way to check use Function below as utils

const containsHTML = (str: string) => /<[a-z][\s\S]*>/i.test(str);

answered Dec 08 '22 at 05:56

Ilya Trofimov

11
1

score 0 · Answer 15 · answered Dec 08 '19 at 12:16

My solution is

const element = document.querySelector('.test_element');

const setHtml = elem =>{
    let getElemContent = elem.innerHTML;

    // Clean Up whitespace in the element
    // If you don't want to remove whitespace, then you can skip this line
    let newHtml = getElemContent.replace(/[\n\t ]+/g, " ");

    //RegEX to check HTML
    let checkHtml = /<([A-Za-z][A-Za-z0-9]*)\b[^>]*>(.*?)<\/\1>/.test(getElemContent);

    //Check it is html or not
    if (checkHtml){
        console.log('This is an HTML');
        console.log(newHtml.trim());
    }
    else{
        console.log('This is a TEXT');
        console.log(elem.innerText.trim());
    }
}

setHtml(element);

Your regular expression [seems highly defective](https://regex101.com/r/6GOzkH/2) vs. [a more comprehensive expression](https://regex101.com/r/GIzRty/21), and requiring pre-processing (the initial replacement) is highly unfortunate. — amcgregor, May 15 '20 at 14:49

Herman Autore · Answer 16 · 2021-07-28T15:38:25.190

Here's a regex-less approach I used for my own project.

If you are trying to detect HTML string among other non-HTML strings, you can convert to an HTML parser object and then back to see if the string lengths are different. I.e.:

An example Python implementation is as follows:

def isHTML(string):
    string1 = string[:]
    soup = BeautifulSoup(string, 'html.parser')  # Can use other HTML parser like etree
    string2 = soup.text

    if string1 != string2:
        return True
    elif string1 == string2:
        return False

It worked on my sample of 2800 strings.

The pseudocode would be

define function "IS_HTML"
  input = STRING
  set a copy of STRING as STRING_1
  parse STRING using an HTML parser and set as STRING_2
  IF STRING_1 is equal to STRING_2
  THEN RETURN TRUE
  ELSE IF STRING_1 is not equal to STRING_2
  THEN RETURN FALSE

This worked for me in my test case, and it may work for you.

I guess you'd better mark somewhere that this is python solution (while question is about JS) — Nikita Popov, Jul 28 '21 at 14:55

niko · Answer 17 · 2021-05-12T14:39:23.433

I needed something similar for xml strings. I'll put what I came up with here in case it might be useful to anyone..

static isXMLstring(input: string): boolean {
    const reOpenFull = new RegExp(/^<[^<>\/]+>.*/);
    const reOpen = new RegExp(/^<[^<>\/]+>/);
    const reCloseFull = new RegExp(/(^<\/[^<>\/]+>.*)|(^<[^<>\/]+\/>.*)/);
    const reClose = new RegExp(/(^<\/[^<>\/]+>)|(^<[^<>\/]+\/>)/);
    const reContentFull = new RegExp(/^[^<>\/]+.*/);
    const reContent = new RegExp(/^[^<>&%]+/); // exclude reserved characters in content

    const tagStack: string[] = [];

    const getTag = (s: string, re: RegExp): string => {
      const res = (s.match(re) as string[])[0].replaceAll(/[\/<>]/g, "");
      return res.split(" ")[0];
    };

    const check = (s: string): boolean => {
      const leave = (s: string, re: RegExp): boolean => {
        const sTrimmed = s.replace(re, "");
        if (sTrimmed.length == 0) {
          return tagStack.length == 0;
        } else {
          return check(sTrimmed);
        }
      };

      if (reOpenFull.test(s)) {
        const openTag = getTag(s, reOpen);
        tagStack.push(openTag); // opening tag
        return leave(s, reOpen);
      } else if (reCloseFull.test(s)) {
        const openTag = tagStack.pop();
        const closeTag = getTag(s, reClose);
        if (openTag != closeTag) {
          return false;
        }
        // closing tag
        return leave(s, reClose);
      } else if (reContentFull.test(s)) {
        if (tagStack.length < 1) {
          return false;
        } else {
          return leave(s, reContent); // content
        }
      } else {
        return false;
      }
    };

    return check(input);
  }

score 0 · Answer 18 · answered Aug 30 '22 at 18:28

The most voted answer validates the following string as a HTML pattern when it obviously isn't:

true = (b<a || b>=a)

A better approach would be <([a-zA-Z]+)(\s*|>).*(>|\/\1>) which can be visualized here.

See also the HTML Standard for further information.

This pattern is not going to validate your HTML document but rather a HTML tag. Obviously there is still room for improvements, the more you improve it the sooner you get a very-huge-complex HTML validation pattern, something you would want to avoid.

Example:

<t>
<a >
<g/>
<tag />
<tag some='1' attributes=2 foo >...
<tag some attributes/>
<tag some attributes/>...</tagx>

score -1 · Answer 19 · answered May 13 '20 at 03:16

-1

There is an NPM package is-html that can attempt to solve this https://github.com/sindresorhus/is-html

answered May 13 '20 at 03:16

Colin D

2,822
1
31
38

I do not [comprehend the expression it is attempting to use](https://regex101.com/r/JCgaRN/2) which fails except on the declared doctype, and the "full" pattern constructed from known HTML elements pulled in from an additional dependency ignores the fact that that's not how HTML works, and hasn't been for a very, very long time. Additionally, the base pattern explicitly mentions `` and `` tags, [both of which are entirely optional](https://tomhodgins.hashnode.dev/code-that-you-just-never-ever-need-to-write-cjpblnfff00km0ys149kbttbg). The "not match XML" test is telling. – amcgregor May 15 '20 at 14:54
@amcgregor if you think your solution is better maybe contribute to the isHTML repo? and add your suite of tests from regex101? it would be valuable to the community – Colin D May 18 '20 at 17:22
The fundamental purpose of that library is misguided and will inherently be wrong in a large number of cases, usually by false-flagging as not-HTML due to the presence of tags it does not understand; _validation_ can not succeed this way. Additionally, a simple regex or a (edit: [pair of](https://www.npmjs.com/package/html-tags)) librar[ies]… [we may have forgotten how to program](http://das.encs.concordia.ca/blog/why-do-developers-use-trivial-packages-npm/), and Node/NPM is not a language or toolchain I generally wish to utilize, contribute to, or encourage the use of. – amcgregor May 19 '20 at 19:06
Alright amcgergor, you are being pretty negative to me when I was just trying to help. I disagree with the premise of npm being misguided. Imagine your stack overflow answer came up with a small tweak in the future. I, as a developer using your library, would just upgrade, and I would get more proper behavior. Instead, I have to....live with the broken behavior or revisit this stack overflow answer to get your edits? That is the alternative universe – Colin D May 19 '20 at 20:18
Negative? I was explaining my stance and why I would not be doing what would otherwise seem a sensible thing. Note, however, that the article I linked was the follow-on from [a slightly more inflammatory first](https://www.davidhaney.io/npm-left-pad-have-we-forgotten-how-to-program/) (linked up-front) which generated plenty of discussion. [He published a technical paper](http://das.encs.concordia.ca/uploads/2017/07/Abdalkareem_FSE2017.pdf), also linked there, towards the bottom. I counter your gut feeling about re-work with evidence about quality. Ref: §7.2 (& the left-pad disaster & eslint) – amcgregor May 19 '20 at 20:28
This is tangential to the original issue now. Thanks for your perspective though. – Colin D May 19 '20 at 20:33

Check if a string is html or not

19 Answers19

Q1: Does a string contain HTML fragments?

Q2: Is the string an HTML document?

Linked

Related