2

Nearly all browsers use a certain amount of leeway in rendering invalid HTML. For example, they would render x < y as if it were written x &lt; y because it is "clear" that the < is intended as a literal character, not part of an HTML tag.

Where can I find that logic as a separate "cleanup" module? Such a module would convert x < y to x &lt; y

JoelFan
  • 37,465
  • 35
  • 132
  • 205
  • What are you using it for? If you're rendering user-content, it would be better to escape the who thing and output it. If you're writing a rendering engine... Good luck. – Mike Caron Aug 04 '10 at 17:52
  • I am rendering user content, but I want to retain certain "safe" tags. I'm already using a module that removes "unsafe" tags, but it's also removing invalid HTML that looks like an unrecognized tag. I want to "clean it up" before handing it over to the module. – JoelFan Aug 04 '10 at 17:56
  • Check my answer, you can do this without any modules – Mike Caron Aug 04 '10 at 18:03

5 Answers5

3

Try looking at the source code for Tidy.

HTML before running through Tidy:

<html>

 <head>
  <title>boo</title>
 </head>

 <body>
   x < y
 </body>

</html>

Same HTML after running through Tidy:

<html>
<head>
  <meta name="generator" content=
  "HTML Tidy for Linux (vers 25 March 2009), see www.w3.org">

  <title>boo</title>
</head>

<body>
  x &lt; y
</body>
</html>

Notice that x < y was changed to x &lt; y.

UPDATE

Based on your comment, you should probably use Tidy to clean up your HTML. I believe there are Tidy libraries for most of the common languages, that will clean up your HTML for you. If you are using PHP, there is PHP Tidy.

UPDATE

I noticed that you said you're using C#. You can use Tidy with C# as well. Here's something I found. I don't develop in C# and I haven't tried this out so YMMV:

Fix Up Your HTML with HTML Tidy and .NET

Vivin Paliath
  • 94,126
  • 40
  • 223
  • 295
0

Not sure what do you mean exactly, but maybe the PHP function htmlentities could help you.

aletzo
  • 2,471
  • 1
  • 27
  • 31
0

Rendering of invalid HTML in browsers is horrible guesswork, and you really shouldn't try to emulate it (it will break). However, replacing some occurrences could be done with a regexp:

preg_replace('/(\s)<(\s)/', '$1&lt;$2', $data);
You
  • 22,800
  • 3
  • 51
  • 64
0

The HTML 5 (draft) specification includes a detailed parsing algorithm based on how browsers handle bad markup.

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
-1

Edit: I am assuming you're using PHP, since you didn't specify

Use strip_tags:

$content = strip_tags($content, array('<b><i>'));

This will leave safe tags (as defined by you), and remove everything else.

Mike Caron
  • 14,351
  • 4
  • 49
  • 77
  • I'm not using PHP, but I'm using something similar to strip_tags in C#. The problem is that my "strip_tags" thinks that "x < y" contains an unknown (and unterminated) tag called "y" and it "strips" it, leaving just "x" – JoelFan Aug 04 '10 at 18:18
  • @David It's the most common web development language. And, everyone else assumed that too. The onus is on the OP to specify, right? – Mike Caron Aug 04 '10 at 20:02
  • @Joel Ah, in that case, I'd go with someone else's answer. Vivin's is the only one with a C# answer, so... yeah. – Mike Caron Aug 04 '10 at 20:03
  • @David, PHP is the most common language. OP should specify or at least tag his question, otherwise you need to make these assumptions. – You Aug 04 '10 at 20:24