1

I have code with several lines like this

<p> &lt;inset&gt;</p>

Where there may be any number of spaces or tabs (or none) between the opening <p> tag and the rest if the string. I need to replace these, but I can't get it to work.

I thought this would do it, but it doesn't work:

<p>[ \t]+&lt;inset&gt;</p>
hakre
  • 193,403
  • 52
  • 435
  • 836
artmem
  • 353
  • 6
  • 16
  • 2
    Every time you regex some html, Allan Turing stomps on a kitten. – Marc B Feb 23 '12 at 18:30
  • 1
    @MarcB: Too funny. :-) There's nothing wrong with using regex on HTML when what you're wanting to do is very simple, though (like this). – FtDRbwLXw6 Feb 23 '12 at 18:35

3 Answers3

5

Try this:

$html = preg_replace('#(<p>)\s+(&lt;inset&gt;</p>)#', '$1$2', $html);
hakre
  • 193,403
  • 52
  • 435
  • 836
FtDRbwLXw6
  • 27,774
  • 13
  • 70
  • 107
1

If you want true text-trimming for HTML including everything you can encounter like those entitites, comments, child-elements and all that stuff, you can make use of a TextRangeTrimmer and TextRange:

$htmlFragment = '<p> &lt;inset&gt;</p>';

$dom = new DOMDocument();
$dom->loadHTML($htmlFragment);
$parent = $dom->getElementsByTagName('body')->item(0);
if (!$parent)
{
    throw new Exception('Parent element not found.');
}

$range = new TextRange($parent);
$trimmer = new TextRangeTrimmer($range);
$trimmer->ltrim();

// inner HTML (PHP >= 5.3.6)
foreach($parent->childNodes as $node)
{
    echo $dom->saveHTML($node);
}

Output:

<p>&lt;inset&gt;</p>

I've both classes in a gist: https://gist.github.com/1894360/ (codepad viper is down).

See as well the related questions / answers:

Community
  • 1
  • 1
hakre
  • 193,403
  • 52
  • 435
  • 836
-2

Try to load your HTML string into a DOM tree instead, and then trim all the text values in the tree.

http://php.net/domdocument.loadhtml

http://php.net/trim

penartur
  • 9,792
  • 5
  • 39
  • 50
  • That's more a comment than an answer. – hakre Feb 23 '12 at 18:39
  • OP said: "_I need to replace these, but I can't get it to work_", and my answer allows them to get rid of the extra whitespace characters as they want. As Marc B said, _Every time you regex some html, Allan Turing stomps on a kitten_. – penartur Feb 23 '12 at 19:08