1

I'm trying to remove all <br> before my text.

So I have this:

<p>
 <br/><br/>When the battle is on between contestants in a talent show, it gets really competitive when down to the last four.  X-FactorUSAcontestant Marcus Canty knows this all too well as this is the stage he was voted off of the show earlier this year. <br/><br/>
</p>

I want to get rid of the first two <br/> but also I'd want to get rid of them if there were more than 2.

I would prefer to sue xpath as I'm already using it, at the moment I have this.

foreach($xpath->query('//br[not(preceding::text())]') as $node) {
    $node->parentNode->removeChild($node);
} 

For some reason on this particular page it doesn't seem to be working.

UPDATE

Originally the question was why was there
at the start of document when my xpath should be getting rid of them (see below). I applied some regex to see if that worked which revealed the doctype you see now. I thought the doctype was somehow causing my original problem but it just wasn't being shown until now. This content is what I've imported from blogger and currently manipulating to fit a new blog.

link to example page

!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN” “http://www.w3.org/TR/REC-html40/loose.dtd”><br><br>

Here's my code:

global $post;
$postTime = $post - > post_date;
$postTime = strtotime($postTime);
$startDate = "2014/01/16";
if ($postTime < strtotime($startDate)) {
    $html = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
    $doc = new DOMDocument();@$doc - > loadHTML($html);
    $xpath = new DOMXPath($doc);
    foreach($xpath - > query('//br[not(preceding::text())]') as $node) {
        $node - > parentNode - > removeChild($node);
    }
    $nodes = $xpath - > query('//a[string-length(.) = 0]');
    foreach($nodes as $node) {
        $node - > parentNode - > removeChild($node);
    }
    $nodes = $xpath - > query('//*[not(text() or node() or self::br)]');
    foreach($nodes as $node) {
        $node - > parentNode - > removeChild($node);
    }
    remove_filter('the_content', 'wpautop');
    $content = $doc - > saveHTML();
    $content = ltrim($content, '<br>');
    $content = strip_tags($content, '<br> <a> <iframe>');
    $content = preg_replace(array('/(<br\s*\/?>\s*){1,}/'), array('<br/><br/>'), $content);
    $content = str_replace('&nbsp;', ' ', $content);
    $content = "<p>".implode("</p>\n\n<p>", preg_split('/\n(?:\s*\n)+/', $content))."</p>";
    return $content;

Help appreciated.

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
UzumakiDev
  • 1,286
  • 2
  • 17
  • 39
  • I think now the regex works:
    ]*/> actually states that whatever content between `
    ` is ignored...
    – Willem Van Onsem Jan 21 '14 at 14:09
  • [Do not parse \[x\]HTML with RegEx](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Jørgen R Jan 21 '14 at 14:26
  • @jurgemaister Oh look it's that regex post EVERYBODY links to when anybody is doing anything regex... – UzumakiDev Jan 21 '14 at 14:27
  • Not anything with regex. Only regex and HTML. Wonder why... – Jørgen R Jan 21 '14 at 14:28
  • @jurgemaister so what are you saying, is it my regex or my xpath that is causing my doctype to be printed on the page? – UzumakiDev Jan 21 '14 at 14:30
  • I'm not sure. Can you provide a sample document? – Jørgen R Jan 21 '14 at 14:36
  • @jurgemaister here's a link to a page http://lartmagazine.co.uk/music-update-with-marcus-canty-wale/ – UzumakiDev Jan 21 '14 at 14:39
  • I'm somewhat confused. There's a `!DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN” “http://www.w3.org/TR/REC-html40/loose.dtd”>` inside the `.textContent` while the document itself has a HTML5 doctype. Why is it that? – Jørgen R Jan 21 '14 at 14:44
  • @jurgemaister I know right? No idea, originally the question was why was there `
    ` at the start of document when my xpath should be getting rid of them (see above). I applied some regex to see if that worked which revealed the doctype you see now. I thought the doctype was somehow causing my original problem but it just wasn't being shown until now. This content is what I've imported from blogger and currently manipulating to fit a new blog.
    – UzumakiDev Jan 21 '14 at 14:48

2 Answers2

1

What about ltrim?

$string = ltrim($string, '<br/>');
Mario Radomanana
  • 1,698
  • 1
  • 21
  • 31
0

You could try using a regex

s/!DOCTYPE html PUBLIC “-\/\/W3C\/\/DTD HTML 4.0 Transitional\/\/EN” “http:\/\/www.w3.org\/TR\/REC-html40\/loose.dtd”>((<br[^>]*/>)+)(.*)/\3/

or in PHP:

$pattern = '/^((<br[^>]*/>)+)(.*)/i';
$replacement = '$3';
$content = preg_replace($pattern, $replacement, $content);
Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • Thanks, but another issue has come up please see updated question. – UzumakiDev Jan 21 '14 at 13:57
  • What do you actually mean by wrapped? Do you mean the `
    ` tag is actually `
    `?
    – Willem Van Onsem Jan 21 '14 at 13:59
  • I mean now I'm getting `

    !DOCTYPE html PUBLIC “-//W3C//DTD HTML 4.0 Transitional//EN” “http://www.w3.org/TR/REC-html40/loose.dtd”>

    `

    – UzumakiDev Jan 21 '14 at 14:15
  • Right, it's what is being output on my page, if you look at the xpath stuff I'm doing somewhere along the line my doctype is being added to the content, I'm not sure why or how to get rid of it. That's why it's being echo'd on my page. – UzumakiDev Jan 21 '14 at 14:19
  • Well you can get rid of it by inserting it in the regex as well. However I'm wondering why the doctype is printed in the content. I don't see how the algorithm you present can do that, so probably there is something wrong with the server providing the orignal content. – Willem Van Onsem Jan 21 '14 at 14:22
  • Well yea, originally I was wondering why my content was showing `
    ` before the text when my xpath shouldn't allow it, then the doctype appeared which is what I think was causing the `
    ` to appear in the first place. Ahh, I'm just going to post another question.
    – UzumakiDev Jan 21 '14 at 14:25