1

I need help doing a few things with XPath in PHP.

With any given HTML, I need to:

  • Remove all tables and their contents
  • Remove everything after the first h1 tag
  • Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))

With regex, I got everything working perfectly. When I encountered nested tables, however, I decided that it is indeed foolish to parse HTML with regex.

Thanks so much!

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
Peter
  • 4,021
  • 5
  • 37
  • 58

2 Answers2

1

With any given HTML, I need to:

• Remove all tables and their contents

• Remove everything after the first h1 tag

• Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))

This can be done very easily with XSLT:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:h="http://www.w3.org/1999/xhtml" >
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <!-- Copy every node except when overriden
      by another template -->
 <xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="node()|@*"/>
  </xsl:copy>
 </xsl:template>

 <!-- Remove all tables and their contents -->
 <xsl:template match="h:table"/>

 <!-- Remove everything after the first h1 -->
 <xsl:template match="node()[preceding::h:h1]"/>

 <!-- Keep only paragraphs (INCLUDING
      their inner HTML (links, lists, etc))
  -->
 <xsl:template match=
 "node()[not(self::h:p) and not(ancestor::h:p)]">
  <xsl:apply-templates/>
 </xsl:template>
</xsl:stylesheet>

In case your element names are not in the XHtml namespace, simple delete any occurence of h: in the above code.

Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • This is very nice. I'll have to read up on XSLT. How do I incorporate these solutions with PHP? Is it similar to using XPath queries? – Peter Dec 31 '10 at 19:17
  • @Peter: I am not using PHP, but AFAIK PHP uses the LibXml/LibXslt processor. Just search for it on the Internet and SO -- there should be many examples. – Dimitre Novatchev Dec 31 '10 at 19:23
0

Consider using HTML DOM parsers as this will be much easier then XML. There are some parsers that support xpath statements as well. But the tricky part is that not all HTML conforms to strict xhtml standards so the rules are not always easy to apply. Here are a couple DOM parsers I came across. Some support xpath and some just have other ways of selecting content:

http://simplehtmldom.sourceforge.net/

http://php.net/manual/en/simplexmlelement.xpath.php

spinon
  • 10,760
  • 5
  • 41
  • 59
  • Suggested third party alternatives to [SimpleHtmlDom](http://simplehtmldom.sourceforge.net/) that actually use [DOM](http://php.net/manual/en/book.dom.php) instead of String Parsing: [phpQuery](http://code.google.com/p/phpquery/), [Zend_Dom](http://framework.zend.com/manual/en/zend.dom.html), [QueryPath](http://querypath.org/) and [FluentDom](http://www.fluentdom.org). – Gordon Jan 04 '11 at 09:43