How to keep
with XPATH?

Question

I use XPATH to remove untidy HTML tags,

$nodeList = $xpath->query("//*[normalize-space(.)='' and not(self::br)]");
    foreach($nodeList as $node) 
    {
        $node->parentNode->removeChild($node);
    }

will remove the horrible input like these,

<p><em><br /></em></p>
<p><span style="text-decoration: underline;"><em><br /></em></span></p>

but it also removes the img tag like blow that I want to keep,

<p><img title="picture summit" src="images/32913430_127001_e.jpg" alt="picture summit" width="590" height="366" /></p>

How can I keep the img tag input with XPATH?

Note that using the element `br` in a paragraph to provoke a carriage return without starting a new paragraph is perfectly valid. You want to remove empty paragraphs ? If so, you will have to explicitly consider the elements you want and those you don't want. Like keep `img` but filter out anything else. — Ludovic Kuty, Oct 22 '11 at 16:15
Good question, +1. Before even starting to write XPath expressions, it is a good idea to think and specify well exactly what elements inside a `p` make it "non-empty". — Dimitre Novatchev, Oct 22 '11 at 17:50
Sibling Question: [Remove

with XPATH](http://stackoverflow.com/q/7856414/367456) — hakre, Jun 23 '13 at 22:36

Dimitre Novatchev · Accepted Answer · 2011-10-23T14:38:29.310

1

Use:

//p[not(descendant::*[self::img or self::br]) and normalize-space()='']

edited Oct 23 '11 at 14:38

answered Oct 22 '11 at 17:48

Dimitre Novatchev

240,661
26
293
431

Sorry, I got this error actually, `Warning: DOMXPath::query() [domxpath.query]: Invalid expression in C:\wamp\www\test\2011\php\tidy_html\dom_tidy_html_5.php on line 120` it refers to `//p[not(descendant::/*[self::img or self::br]) and normalize-space()='']`... – Run Oct 23 '11 at 13:27
I amended the expression and now it works with this `//p[not(descendant::*[self::img or self::br]) and normalize-space()='']` – Run Oct 23 '11 at 13:34
@lauthiamkok: Yes, it was a typo, thanks for noticing this -- I already edited my answer. – Dimitre Novatchev Oct 23 '11 at 14:39

score 0 · Answer 2 · answered Oct 22 '11 at 17:08

0

Maybe you could use an XPath 1.0 expression like the one below to remove unwanted paragraphs:

//p[count(text())=0 and count(img)=0]

answered Oct 22 '11 at 17:08

Ludovic Kuty

4,868
3
28
42

How to keep
with XPATH?

2 Answers2

Linked

How to keep with XPATH?

2 Answers2

Linked

How to keep
with XPATH?