invalid HTML rendering logic

Question

Nearly all browsers use a certain amount of leeway in rendering invalid HTML. For example, they would render x < y as if it were written x < y because it is "clear" that the < is intended as a literal character, not part of an HTML tag.

Where can I find that logic as a separate "cleanup" module? Such a module would convert x < y to x < y

What are you using it for? If you're rendering user-content, it would be better to escape the who thing and output it. If you're writing a rendering engine... Good luck. — Mike Caron, Aug 04 '10 at 17:52
I am rendering user content, but I want to retain certain "safe" tags. I'm already using a module that removes "unsafe" tags, but it's also removing invalid HTML that looks like an unrecognized tag. I want to "clean it up" before handing it over to the module. — JoelFan, Aug 04 '10 at 17:56

Vivin Paliath · Accepted Answer · 2010-08-04T19:35:11.443

Try looking at the source code for Tidy.

HTML before running through Tidy:

<html>

 <head>
  <title>boo</title>
 </head>

 <body>
   x < y
 </body>

</html>

Same HTML after running through Tidy:

<html>
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">

  <title>boo</title>
</head>

<body>
  x &lt; y
</body>
</html>

Notice that x < y was changed to x < y.

UPDATE

Based on your comment, you should probably use Tidy to clean up your HTML. I believe there are Tidy libraries for most of the common languages, that will clean up your HTML for you. If you are using PHP, there is PHP Tidy.

UPDATE

I noticed that you said you're using C#. You can use Tidy with C# as well. Here's something I found. I don't develop in C# and I haven't tried this out so YMMV:

Fix Up Your HTML with HTML Tidy and .NET

score 0 · Answer 2 · answered Aug 04 '10 at 17:55

0

Not sure what do you mean exactly, but maybe the PHP function htmlentities could help you.

answered Aug 04 '10 at 17:55

aletzo

2,471
1
27
31

No... see my response to @Mike Caron's comment – JoelFan Aug 04 '10 at 17:57

score 0 · Answer 3 · answered Aug 04 '10 at 18:00

0

Rendering of invalid HTML in browsers is horrible guesswork, and you really shouldn't try to emulate it (it will break). However, replacing some occurrences could be done with a regexp:

preg_replace('/(\s)<(\s)/', '$1&lt;$2', $data);

answered Aug 04 '10 at 18:00

You

22,800
3
51
64

This will change ` < body>` to ` < body>`. Undesirable. – Vivin Paliath Aug 04 '10 at 18:01
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Chuck Aug 04 '10 at 18:29
@Vivin: It is. It relies to a certain extent on users formatting their input properly, but it's fairly good. @Chuck: We're not actually parsing HTML here, but yeah. – You Aug 04 '10 at 19:43
I tend to be more paranoid :) – Vivin Paliath Aug 04 '10 at 20:02

score 0 · Answer 4 · answered Aug 04 '10 at 18:09

0

The HTML 5 (draft) specification includes a detailed parsing algorithm based on how browsers handle bad markup.

answered Aug 04 '10 at 18:09

Quentin

914,110
126
1,211
1,335

score -1 · Answer 5 · answered Aug 04 '10 at 18:02

-1

Edit: I am assuming you're using PHP, since you didn't specify

Use strip_tags:

$content = strip_tags($content, array('<b><i>'));

This will leave safe tags (as defined by you), and remove everything else.

answered Aug 04 '10 at 18:02

Mike Caron

14,351
4
49
77

I'm not using PHP, but I'm using something similar to strip_tags in C#. The problem is that my "strip_tags" thinks that "x < y" contains an unknown (and unterminated) tag called "y" and it "strips" it, leaving just "x" – JoelFan Aug 04 '10 at 18:18
@David It's the most common web development language. And, everyone else assumed that too. The onus is on the OP to specify, right? – Mike Caron Aug 04 '10 at 20:02
@Joel Ah, in that case, I'd go with someone else's answer. Vivin's is the only one with a C# answer, so... yeah. – Mike Caron Aug 04 '10 at 20:03
@David, PHP is the most common language. OP should specify or at least tag his question, otherwise you need to make these assumptions. – You Aug 04 '10 at 20:24

invalid HTML rendering logic

5 Answers5