3

I need HTML SAX (not DOM!) parser for PHP able to process even invalid HTML code. The reason i need it is to filter user entered HTML (remove all attributes and tags except allowed ones) and truncate HTML content to specified length.

Any ideas?

Daniel
  • 4,272
  • 8
  • 35
  • 48
  • hi, i am currently searching for such myself. i wonder if you are still using HTML SAX Parser, or if you've found something else? – aurora Mar 16 '12 at 14:04
  • Tidy is the unique "general solution" for "invalid HTML code", and PHP have a *build-in good SAX* (!) see [my answer below](http://stackoverflow.com/a/17903058/287948). – Peter Krauss Jul 27 '13 at 22:25
  • See similar question: http://stackoverflow.com/q/15679103/287948 – Peter Krauss Jul 27 '13 at 23:16

4 Answers4

4

SAX was made to process valid XML and fail on invalid markup. Processing invalid HTML markup requires keeping more state than SAX parsers typically keep.

I'm not aware of any SAX-like parser for HTML. Your best shot is to use to pass the HTML through tidy before and then use a XML parser, but this may defeat your purpose of using a SAX parser in the first place.

Artefacto
  • 96,375
  • 17
  • 202
  • 225
  • even after tidy pieces of HTML won't be valid. they're like this: `some comment with bold text, italic text.` it's invalid document for any XML parser. there's no root, but i don't want to mess around with wrapping content with some root element. – Daniel May 30 '10 at 16:57
  • @Daniel why do you need an event handler in the first place. If the HTML snippets are short, I see no compelling reason. – Artefacto May 30 '10 at 17:25
  • @Daniel Sorry, I meant an event driven API such as SAX. – Artefacto May 30 '10 at 19:16
  • oh, i've already got implementation using SAX parser, it's very efficient and simple, but its problem is SAX parser itself. it uses regexp to parse HTML :( – Daniel May 30 '10 at 20:30
  • @Daniel HTML parsing with regex => trouble – Artefacto May 30 '10 at 21:23
  • agree. thats why i'm looking for something better. – Daniel May 30 '10 at 23:48
1

Summarizing as two steps:

  1. Use Tidy to transform "free HTML" into "good XHTML".
  2. Use XML Parser to parse XHTML as XML by SAX API.

Use first Tidy (!), to transform "free HTML" into XHTML (or when you can not trust your "supposed XHTML"). See cleanRepair method. It needs more time, but runs with big files (!)... Set some minutes as maximum execution time if too big.

Another option (for work with big files) is to cache your XHTML files after checked or transformed into XHTML. See Tidy's repairfile method.

With a "trusted XHTML", use SAX... How to use SAX with PHP?

Parse XML with a SAX standard API, that in PHP is implemented by LibXML (see LibXML2 at xmlsoft.org), and its interface is the PHP's XML Parser, that is near to the SAX standard API.

Another way to use the "SAX of LibXML2", with another interface (a PHP iterator instead the traditional SAX interface), is to use XMLReader. See this explanation about "XMLReader use SAX".


Yes, the terms "SAX" or "SAX API" not expressed in the PHP manual (!!). See this old but good introduction.

Community
  • 1
  • 1
Peter Krauss
  • 13,174
  • 24
  • 167
  • 304
1

Try to use HTML SAX Parser

Murad X
  • 171
  • 1
  • 10
  • I've tried to use it, it can't handle embedded js or complex styles because its based on regexes. – Daniel Aug 09 '10 at 13:05
  • I used it to solve the problem that you are trying to solve. I filter user-generated content, cut JavaScript, tags, attributes. – Murad X Aug 10 '10 at 14:57
0

I may suggest the pear package here : http://pear.php.net/package/XML_HTMLSax/redirected

dader
  • 1,304
  • 1
  • 12
  • 31