1

I have some data that is provided to me as $data, an example of some of the data is...

<div class="widget_output">
<div id="test1">
    Some Content
</div>
    <ul>
        <li>
            <p>
                <div>768hh</div>
                <div>2308d</div>
                <div>237ds</div>
                <div>23ljk</div>
            </p>
       </li>
        <div id="temp3">
            Some more content
        </div>
       <li>
            <p>
                <div>lkgh322</div>
                <div>32khhg</div>
                <div>987dhgk</div>
                <div>23lkjh</div>
            </p>
        </li>
</div>

I am attempting to change the non valid HTML DIVs inside the paragraphs so i end up with this instead...

   <div class="widget_output">
<div id="test1">
    Some Content
</div>
    <ul>
        <li>
            <p>
                <span>768hh</span>
                <span>2308d</span>
                <span>237ds</span>
                <span>23ljk</span>
            </p>
       </li>
        <div id="temp3">
            Some more content
        </div>
       <li>
            <p>
                <span>lkgh322</span>
                <span>32khhg</span>
                <span>987dhgk</span>
                <span>23lkjh</span>
            </p>
        </li>
</div>

I am trying to do this using str_replace with something like...

$data = str_replace('<div>', '<span>', $data);
$data = str_replace('</div>', '</span', $data);

Is there a way I can combine these two statements and also make it so that they only affect the 'This is a random item' and not the other occurences?

fightstarr20
  • 11,682
  • 40
  • 154
  • 278
  • Not sure if it can handle this specific case, but you might want to look into a library such as [HTML Purifier](http://htmlpurifier.org/) which is designed to (among other things) convert untrusted (e.g., user-input) HTML into standards-compliant markup. –  Jul 29 '12 at 22:07
  • Will the errant text always start with "This is a random item", or are you trying to match *any* `
    ` inside of a `

    `?

    –  Jul 29 '12 at 22:22

3 Answers3

4
$data = str_replace(array('<div>', '</div>'), array('<span>', '</span>'), $data);

As long as you didn't give any other details and only asked:

Is there a way I can combine these two statements and also make it so that they only affect the 'This is a random item' and not the other occurences?

Here you go:

$data = str_replace('<div>This is a random item</div>', '<span>This is a random item</span>', $data);
zerkms
  • 249,484
  • 69
  • 436
  • 539
  • This will get all `
    ` however. Note that only the innermost `
    ` have been converted to ``
    – Michael Berkowski Jul 29 '12 at 21:55
  • @Frits van Campen: now it does ;-) – zerkms Jul 29 '12 at 21:58
  • @iblue In full agreement with zerkms here. There are lots of times when simple string operations are preferable to DOM manipulation, without resorting to regular expressions in any way. – Michael Berkowski Jul 29 '12 at 22:02
  • When I say 'This is a random item' this is just an example, the items are dynamically generated so are all different everytime. Is there a way I can cater for this fact? – fightstarr20 Jul 29 '12 at 22:02
  • @fightstarr20: how are we supposed to differ "these items" from "those items"? – zerkms Jul 29 '12 at 22:04
  • I have updated the original question to show a bit more clearly that the divs are random each time. Is there a way to say only change the to if the original
    does not have an id or a class?
    – fightstarr20 Jul 29 '12 at 22:09
  • @fightstarr20: so they will always have "This is a random item" in their beginning? – zerkms Jul 29 '12 at 22:15
  • No, that is just there to make a point. They are completely random every time. But the thing that differentiates them between everything else is that the starting div for the random items has no class or id. – fightstarr20 Jul 29 '12 at 22:16
  • and now it becomes [parsing HTML with regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) :P – Vatev Jul 29 '12 at 22:17
  • @fightstarr20: then your case is not exception and you need to use DOM parser – zerkms Jul 29 '12 at 22:17
  • 1
    @Vatev: yep, it becomes it *only now* ;-) Before that it was a valid candidate for being "not overcomplicate" – zerkms Jul 29 '12 at 22:18
  • @fightstarr20: the opposite. Regex is the way you should not follow, but use DOM parser instead. – zerkms Jul 29 '12 at 22:19
  • Ok point taken, I have installed the simple PHP Dom Parser library now and I am trying to implement. Thanks for the input everyone – fightstarr20 Jul 29 '12 at 23:12
1

You'll need to use a regular expression to do what you are looking to do, or to actually parse the string as XML and modify it that way. The XML parsing is almost surely the "safest," since as long as the string is valid XML, it will work in a predictable way. Regexes can at times fall prey to strings not being in exactly the expected format, but if your input is predictable enough, they can be ok. To do what you want with regular expressions, you'd so something like

$parsed_string = preg_replace("~<div>(?=This is a random item)(.*?)</div>~", "<span>$1</span>, $input_string);

What's happening here is the regex is looking for a <div> tag which is followed by (using a lookahead assertion) This is a random item. It then captures any text between that tag and the next </div> tag. Finally, it replaces the match with <span>, followed by the captured text from inside the div tags, followed by </span>. This will work fine on the example you posted, but will have problems if, for example, the <div> tag has a class attribute. If you are expecting things like that, either a more complex regular expression would be needed, or full XML parsing might be the best way to go.

Michael Fenwick
  • 2,374
  • 2
  • 19
  • 28
  • The text 'This is a random item' was just in my example to make a point that it was random, I have edited the original post to make it a bit more obvious. – fightstarr20 Jul 29 '12 at 22:28
  • Is there some kind of wildcard maybe I can use in your example in place of the 'This is a random item'? – fightstarr20 Jul 29 '12 at 22:43
0

I'm a little surprised by the other answers, I thought someone would post a good one, but that hasn't happened. str_replace is not powerful enough in this case, and regular expressions are hit-and-miss, you need to write a parser.

You don't have to write a full HTML-parser, you can cheat a bit.

$in = '<div class="widget_output">
(..)
</div>';

$lines = explode("\n", $in);

$in_paragraph = false;
foreach ($lines as $nr => $line) {
    if (strstr($line, "<p>")) {
        $in_paragraph = true;
    } else if (strstr($line, "</p>")) {
        $in_paragraph = false;
    } else {
        if ($in_paragraph) {
            $lines[$nr] = str_replace(array('<div>', '</div>'), array('<span>', '</span>'), $line);
        }
    }
}
echo implode("\n", $lines);

The critical part here is detecting whether you're in a paragraph or not. And only when you're in a paragraph, do the string replacement.

Note: I'm splitting on newlines (\n) which is not perfect, but works in this case. You might want to improve this part.

Halcyon
  • 57,230
  • 10
  • 89
  • 128