2

I use XPATH to remove untidy HTML tags,

$nodeList = $xpath->query("//*[normalize-space(.)='' and not(self::br)]");
    foreach($nodeList as $node) 
    {
        $node->parentNode->removeChild($node);
    }

will remove the horrible input like these,

<p><em><br /></em></p>
<p><span style="text-decoration: underline;"><em><br /></em></span></p>

but it also removes the img tag like blow that I want to keep,

<p><img title="picture summit" src="images/32913430_127001_e.jpg" alt="picture summit" width="590" height="366" /></p>

How can I keep the img tag input with XPATH?

hakre
  • 193,403
  • 52
  • 435
  • 836
Run
  • 54,938
  • 169
  • 450
  • 748
  • Note that using the element `br` in a paragraph to provoke a carriage return without starting a new paragraph is perfectly valid. You want to remove empty paragraphs ? If so, you will have to explicitly consider the elements you want and those you don't want. Like keep `img` but filter out anything else. – Ludovic Kuty Oct 22 '11 at 16:15
  • thanks. yes I want to remove empty paragraphs only... – Run Oct 22 '11 at 16:23
  • Good question, +1. Before even starting to write XPath expressions, it is a good idea to think and specify well exactly what elements inside a `p` make it "non-empty". – Dimitre Novatchev Oct 22 '11 at 17:50
  • Sibling Question: [Remove


     

    with XPATH](http://stackoverflow.com/q/7856414/367456)
    – hakre Jun 23 '13 at 22:36

2 Answers2

1

Use:

//p[not(descendant::*[self::img or self::br]) and normalize-space()='']
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
  • Sorry, I got this error actually, `Warning: DOMXPath::query() [domxpath.query]: Invalid expression in C:\wamp\www\test\2011\php\tidy_html\dom_tidy_html_5.php on line 120` it refers to `//p[not(descendant::/*[self::img or self::br]) and normalize-space()='']`... – Run Oct 23 '11 at 13:27
  • I amended the expression and now it works with this `//p[not(descendant::*[self::img or self::br]) and normalize-space()='']` – Run Oct 23 '11 at 13:34
  • @lauthiamkok: Yes, it was a typo, thanks for noticing this -- I already edited my answer. – Dimitre Novatchev Oct 23 '11 at 14:39
0

Maybe you could use an XPath 1.0 expression like the one below to remove unwanted paragraphs:

//p[count(text())=0 and count(img)=0]
Ludovic Kuty
  • 4,868
  • 3
  • 28
  • 42