0

I have a string like

<p begin="00:35:47.079" end="00:35:49.119" region="r8" style="s1">
    <span style="s2" tts:backgroundColor="black">Hello I am a fireman. Good morning</span>
    <br/>
    <span style="s2" tts:backgroundColor="black">Why do you </span>
    <span style="s9" tts:backgroundColor="black">insist on that?</span>
</p>

I am trying to output it like

Hello I am a fireman. Good morning
Why do you insist on that?

I've tried this, which ultimately is outputting it to a file.

$xmlObject = simplexml_load_string($delivery, 'SimpleXmlElement', LIBXML_NOCDATA);
$xmlArray = json_decode(json_encode((array) $xmlObject), TRUE);
foreach($xmlArray['body']['div']['p'] as $p_tag) {
    if (!is_string($p_tag['span'])) {
        $multiLine = '';
        foreach ($p_tag['span'] as $line) {

            if (is_string($line)) {
                    $multiLine .= $line . "\n";
            }
        }
     
        $p_tag['span'] = $multiLine;
    }
}

        foreach($toPrint as $line) {
        
        if (!isset($line['begin'])) {
            continue;
        }
        $endSpace = '';
        if (!$shrunk) {
            $endSpace = '   ';
        }
        fwrite($fileOpen,"\n\n" . $line['begin'] . ' --> ' . $line['end'] . $endSpace . "\n" . $line['content']);
    }

and then printing out $p_tag line for line, but it will of course produce

Hello I am a fireman. Good morning
Why do you 
insist on that?

From here, I've also tried

$value = $Dom->documentElement->nodeValue;
$lines = explode("\n", $value);
$lines = array_map('trim', $lines); // remove leading and trailing whitespace
$lines = array_filter($lines); // remove empty elements


foreach($lines as $line) {
    echo htmlentities($line);
}

But that produces something like

Hello I am a fireman.Good morningWhy do youinsist on that?

When I var_dump the $p_tag, it produces something like this

["span"]=>
array(3) {
[0]=>
string(34) "'Hello I am a fireman. Good morning"
[1]=>
string(28) "Why do you "

[2]=>
string(28) "insist on that?"
}
["br"]=>
array(0) {
}

So the break gets put out of order, so I can't rely on that when looking at the XML object. The spans are grouped, the breaks are in a separate location, so there's no way in that case to put the line breaks in the location that they were in the original string.

Michael Millar
  • 1,364
  • 2
  • 24
  • 46
  • So is the rule that new sentences must start on a new line? – ADyson Apr 07 '21 at 09:27
  • @ADyson Just anything separated by
    should be on new lines. It could be that three sentences are on one line, and five words within the same sentence are on separate lines.
    – Michael Millar Apr 07 '21 at 09:32
  • Are you trying to output this on command-line, or in a web page? Because of course web pages take no account of `\n`. Also your original `foreach` simply discards any information about `
    `, it only looks for spans. You need to modify it so it processes the line breaks too.
    – ADyson Apr 07 '21 at 09:33
  • @ADyson I've updated the question. It's being printed to a file. So you can see where I'm inserting the line breaks. – Michael Millar Apr 07 '21 at 09:47
  • @ADyson I've also updated a comment related to the line break, because it doesn't stay in the original place, due to it being in a new data structure. – Michael Millar Apr 07 '21 at 09:51
  • `So the break gets 'lost'` ...not so much lost, as ignored. – ADyson Apr 07 '21 at 09:57
  • P.S. the var_dump doesn't match the sample text, so it's a bit hard to follow. I would have expected to see the `Hello I am a fireman` etc. in there, rather then `text here, text here` . If you're going to give samples, make sure they match up. – ADyson Apr 07 '21 at 09:58
  • @ADyson I've updated it to clear those up. – Michael Millar Apr 07 '21 at 10:00

1 Answers1

1

You can make use of DOMdocument, searching for paragraph tags and looping the nodes inside it.

$string='<p begin="00:35:47.079" end="00:35:49.119" region="r8" style="s1">
    <span style="s2" tts:backgroundColor="black">Hello I am a fireman. Good morning</span>
    <br/>
    <span style="s2" tts:backgroundColor="black">Why do you </span>
    <span style="s9" tts:backgroundColor="black">insist on that?</span>
</p>';
$result=[];

$doc= new DOMdocument();
$doc->loadHTML($string);

//get all paragraphs by <p>
$par_tag = $doc->getElementsByTagName('p');

//loop all found paragraphs
foreach($par_tag as $par){

//loop the childnodes inside the paragraph    
foreach($par->childNodes as $child){

    //get the nodename of the element
    $tag = $child->nodeName;

    //if it is <span>: get the text
    if($tag==='span')$result[]=$child->nodeValue;
    
    //if it is <br>, add a linefeed
    else if($tag==='br')$result[]="\n";
    }
    }
    
echo implode('',$result);
Michel
  • 4,076
  • 4
  • 34
  • 52
  • Thanks, that's almost it there. I just had to add a `$result[]="\n";` at the bottom of the outer loop. I'll double check everything again though. Also, it required loadXML instead of loadHTML. – Michael Millar Apr 07 '21 at 10:44