0

have the following text:

<html>
<HEAD>
<style type="text/css">
.zwischenueberschrift_3, .zwischenueberschrift{
    color:red;
    font-weight: bold;
}
</style>
</HEAD>
<body>
<inhalt bez="text" sprache="DE">
    <objekt type="xhtml">
        <div class="text">Amet totam deleniti voluptate corporis wisi, donec alias, aspernatur enim leo, sunt, cursus sollicitudin pellentesque tortor rutrum rerum, magna tenetur aliquid ducimus 
        incidunt sociosqu, mollitia dicta totam gravida dignissi.</div>
        <div class="zwischenueberschrift_3">Nostra lobortis magni, eum dapibus nemo culpa augue ligula class! Magna ullamcorper! Reiciendis porttitor cum vitae. Beatae primis dictum adipisicing 
        deserunt duis posuere magna <br />
        </div>
        <div class="text">rovident viverra? Parturient egestas, condimentum! Mollitia anim? Ullam! Saepe dis, sem? Arcu esse class optio atque! Minima incidunt voluptatem porta eaque animi! Nibh dis, 
        tincidunt aliquet? Perferendis massa eiusmod eius? Aut, nobis! Explicabo penatib:<br />
            <ul>
                <li>reiciendis, laboris facilis bibendum, .<br />
                    <br />
                </li>
                <li>porro, facilis felis? Ridiculus dicta! Integer luctus laoreet rhoncus, habita max. 50 %  corrupti,  3.000 kg  habitant corrupti, . <br />
                    <br />
                </li>
                <li>tur quidem, eos consequat:<br />- Bis 23.09.2016 für Oktober, November, Dezember<br />- Bis 14.10.2016 für November, Dezember, Januar<br />- Bis 11.11.2016 für Dezember, Januar, 
                Februar<br />- Bis 09.12.2016 für Januar, Februar, März</li>
                <br />
                <li>ed imperdiet et phasellus adipiscing! Bibendum. Ad. Maiores pellentesque! Mauri.<br />
                    <br />
                </li>
                <li> Habitant dolore! Vestibulum! Conubia quaerat. Minima, nihil penatibus magna adipisci! Ultricies dignissim hic imperdiet. Tempus, distinctio.<br />
                    <br />
                </li>
                <li>e, quae beatae inceptos labore sunt excepturi id, neque saepe quae tellus. N bei 50 bis 80 % das 0,8-fache (11,2 Cent/kg), bei 20 bis 50% das 0,5-fache (7 Cent/kg) und bei 
                weniger als 20% tus dapibus occa.<br />
                    <br />
                </li>
                <li>la, platea. Reprehenderit   <br />
                    <br />
                </li>
                <li>celerisque convallis occaecat im Jahr 2017.  <br />
                    <br />
                </li>
                <li>Ddolore id? Ea sint? Netus quasi vulputate bis zum 23.09.2016 nutzen.</li>
            </ul>
        </div>
        <div class="zwischenueberschrift">litora magni assumenda! Magnis </div>
        <div class="text">llentesque consectetuer voluptatum purus ratione, temporibus deleniti eveniet ullamco eget nostrud? Sodales fusce. Nostrum culpa saepe quis penatibus accusantium? Sagittis 
        porttitor minima nunc ab fermentum incidunt class urna, tempor, ullamcorper quod beatae? Nostra cubilia felis? Mus pretium fames etiam, cras, velit nec quae, voluptates quas voluptas dis 
        inceptos porro dolorem ligula.t.<br /> <br />entium! Consectetuer tenetur, auctor wisi? Voluptatibus reiciendis unde convallis justo incidunt? Itaque leo? Mollit odio ultricies asperiores 
        ullamco parturient sociosqu reiciendis incidunt consequat. Ut est? Impedit pellentesque fringilla eligendi? Mi ear </div>
        <div class="zwischenueberschrift_3">tpat eros, maiores totam cupi<br />
        </div>
        <div class="text">Excepteur saepe occaecati elit. Ex mauris do porttitor? Convallis molestie, consectetuer culpa. Voluptatum dolor ipsum adipiscing, quia, laudantium mi totam. Beatae quae. 
        Praesent excepturi, nemo fringilla similique quisquam sapiente totam fermentum fuga arcu . </div>
</objekt>       
</inhalt>
…
</body>
</html>

In the text all div elements with the class zwischenueberschrift or zwischenueberschrift_3 should be replaced by span.
The css class is only to show the class element.

I have built a regular expression that finds these elements for me, but unfortunately I can't replace them :-(
Here the expression:

<div\sclass\=\"zwischenueberschrift(\_\d|)\"\>[a-zA-Z0-9ÄÖÜäöüß\s\-\<\/\>\.\!\,\r\n\t]+\<\/div\>

What I built to replace the div element either works insufficiently or not at all :-(

$preg ='/<div\sclass\=\"zwischenueberschrift(\_\d|)\"\>[a-zA-Z0-9ÄÖÜäöüß\s\-\<\/\>\.\!\,\r\n\t]+\<\/div\>/';

if( preg_match($preg, $text )){
    $a2 = preg_match_all($preg, $text , $a1 );

    if( is_integer($a2) && intval( $a2) > 0 )
    {
        $sammeln = '';

        $arr_text_zum_ersetzten = array();
        $arr_preg = array();
        foreach( $a1[0] as $key => $value )
        {
            #echo "$key, ".strlen($value)."<br>";

            $text_zum_ersetzten='';
            $text_zum_ersetzten = str_replace("div", "span", $value);
            $text_zum_ersetzten = str_replace("zwischenueberschrift_3", "zwischenueberschrift", $text_zum_ersetzten);
            $text_zum_ersetzten = str_replace("<br />", "", $text_zum_ersetzten);           
            $arr_text_zum_ersetzten[] = $text_zum_ersetzten;            

            $value = str_replace("<", "\<", $value);
            $value = str_replace(">", "\>", $value);
            $value = str_replace("_", "\_", $value);
            $value = str_replace("/", "\/", $value);
            $value = str_replace('"', '\"', $value);
            $value = str_replace("=", "\=", $value);
            $value = str_replace("-", "\-", $value);
            $arr_preg[] = "/".$value."/";

            $sammeln.=$value."\n";
            $sammeln.=$text_zum_ersetzten."\n\n";           
        }
        preg_replace( $arr_preg, $arr_text_zum_ersetzten, $text);

        # Loging to testing, start
        $sammeln.="++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n\n";
        $text.="++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++\n\n";      
        error_log($sammeln, 3, "sammeln_22.log");
        error_log($text, 3, "text11.log");      
        # Loging to testing, ende
    }
}

Has anyone given an idea of how to make this more reliable, if only by the means of regular expressions?

greetings webuser

webuser57
  • 171
  • 2
  • 9
  • 1
    You shouldn't/[can't use regex](https://stackoverflow.com/a/1758162/10400050) to parse HTML. You should use some kind of HTML parser, like [`DOMDocument`](http://php.net/manual/en/class.domdocument.php) if you want to use PHP. – Johan Oct 15 '18 at 07:16
  • Not by themselves but it's possible. For [Example](https://github.com/ArtisticPhoenix/MISC/blob/master/Lexers/HtmlMinifier.php) and in the [Sandbox](http://sandbox.onlinephpfunctions.com/code/7aec59ce8ee620ed70883acb9b44952e2bf36113) Which can minify all but a given HTML tag. I can write one for this post but it's too late for me. Time for bed. – ArtisticPhoenix Oct 15 '18 at 07:30
  • _Possible_, maybe sort of kind of sometimes partially when the phase of the moon is right, but it's an extremely brittle method. This is definitely a job for either DOMDocument or some other similar library, not regex. – kungphu Oct 15 '18 at 07:32
  • What I describe above is not "Brittle" It's a Lexer/Parser. The key here is `Not by themselves` In combination with PHP and basic tokenization, I could solve this with a Regex parser. – ArtisticPhoenix Oct 15 '18 at 07:34

2 Answers2

1

This is the best I can do with the amount of time I have

function parse($subject, $tokens)
{
    $types = array_keys($tokens);
    $patterns = [];
    $lexer_stream = [];
    $result = false;
    foreach ($tokens as $k=>$v){
        $patterns[] = "(?P<$k>$v)";
    }
    $pattern = "/".implode('|', $patterns)."/i";

    if (preg_match_all($pattern, $subject, $matches, PREG_OFFSET_CAPTURE)) {
        //print_r($matches);
        foreach ($matches[0] as $key => $value) {
            $match = [];
            foreach ($types as $type) {
                $match = $matches[$type][$key];
                if (is_array($match) && $match[1] != -1) {
                    break;
                }
            }
            $tok  = [
                'content' => $match[0],
                'type' => $type,
                'offset' => $match[1]
            ];
            $lexer_stream[] = $tok;
        }

        $result = parseTokens( $lexer_stream );
    }
    return $result;
}

function parseTokens( array &$lexer_stream ){
    $result = '';
    $nesting = [0];
    $index = 0;
    while($current = current($lexer_stream)){
        $content = $current['content'];
        $type = $current['type'];
        switch($type){
            case 'T_EOF': return $result;               
            case 'T_START_DIV':                   
                if(preg_match('/zwischenueberschrift/i', $content)){
                    ++$index;
                    if(!isset($nesting[$index])) $nesting[$index] = 0;
                    $content = str_replace('div', 'span', $content);
                }                      
                ++$nesting[$index];                   
            break;
            case 'T_END_DIV':
                  --$nesting[$index];
                 if(!$nesting[$index] && $index){
                     unset($nesting[$index]);
                     if(empty($nesting)) $nesting = [0];
                     --$index;
                     if($index < 0)$index=0;
                     $content = str_replace('div', 'span', $content);
                 }
            break;    

            case 'T_UNKNOWN':
            default:
                //print_r($current);
               //trigger_error("Unknown token $type value $content", E_USER_ERROR);
        }
        $result .= $content;
        next($lexer_stream);
    }
}

/**
 * token should be "name" => "regx"
 * 
 * Order is important
 * 
 * @var array $tokens
 */
$tokens = [
    'T_EOF'             => '\Z',
    'T_START_DIV'      => '<\s*div[^>]*>',
    'T_END_DIV'        => '<\s*\/\s*div[^>]*>',
    'T_UNKNOWN'         => '.+?'
];

$html = <<<HTML
<html>
<HEAD>
<style type="text/css">
.zwischenueberschrift_3, .zwischenueberschrift{
    color:red;
    font-weight: bold;
}
</style>
</HEAD>
<body>
<inhalt bez="text" sprache="DE">
    <objekt type="xhtml">
        <div class="text">Amet totam deleniti voluptate corporis wisi, donec alias, aspernatur enim leo, sunt, cursus sollicitudin pellentesque tortor rutrum rerum, magna tenetur aliquid ducimus 
        incidunt sociosqu, mollitia dicta totam gravida dignissi.</div>
        <div class="zwischenueberschrift_3">Nostra lobortis magni, eum dapibus nemo culpa augue ligula class! Magna ullamcorper! Reiciendis porttitor cum vitae. Beatae primis dictum adipisicing 
        deserunt duis posuere magna <br />
        </div>
        <div class="text">rovident viverra? Parturient egestas, condimentum! Mollitia anim? Ullam! Saepe dis, sem? Arcu esse class optio atque! Minima incidunt voluptatem porta eaque animi! Nibh dis, 
        tincidunt aliquet? Perferendis massa eiusmod eius? Aut, nobis! Explicabo penatib:<br />
            <ul>
                <li>reiciendis, laboris facilis bibendum, .<br />
                    <br />
                </li>
                <li>porro, facilis felis? Ridiculus dicta! Integer luctus laoreet rhoncus, habita max. 50 %  corrupti,  3.000 kg  habitant corrupti, . <br />
                    <br />
                </li>
                <li>tur quidem, eos consequat:<br />- Bis 23.09.2016 für Oktober, November, Dezember<br />- Bis 14.10.2016 für November, Dezember, Januar<br />- Bis 11.11.2016 für Dezember, Januar, 
                Februar<br />- Bis 09.12.2016 für Januar, Februar, März</li>
                <br />
                <li>ed imperdiet et phasellus adipiscing! Bibendum. Ad. Maiores pellentesque! Mauri.<br />
                    <br />
                </li>
                <li> Habitant dolore! Vestibulum! Conubia quaerat. Minima, nihil penatibus magna adipisci! Ultricies dignissim hic imperdiet. Tempus, distinctio.<br />
                    <br />
                </li>
                <li>e, quae beatae inceptos labore sunt excepturi id, neque saepe quae tellus. N bei 50 bis 80 % das 0,8-fache (11,2 Cent/kg), bei 20 bis 50% das 0,5-fache (7 Cent/kg) und bei 
                weniger als 20% tus dapibus occa.<br />
                    <br />
                </li>
                <li>la, platea. Reprehenderit   <br />
                    <br />
                </li>
                <li>celerisque convallis occaecat im Jahr 2017.  <br />
                    <br />
                </li>
                <li>Ddolore id? Ea sint? Netus quasi vulputate bis zum 23.09.2016 nutzen.</li>
            </ul>
        </div>
        <div class="zwischenueberschrift">litora magni assumenda! Magnis </div>
        <div class="text">llentesque consectetuer voluptatum purus ratione, temporibus deleniti eveniet ullamco eget nostrud? Sodales fusce. Nostrum culpa saepe quis penatibus accusantium? Sagittis 
        porttitor minima nunc ab fermentum incidunt class urna, tempor, ullamcorper quod beatae? Nostra cubilia felis? Mus pretium fames etiam, cras, velit nec quae, voluptates quas voluptas dis 
        inceptos porro dolorem ligula.t.<br /> <br />entium! Consectetuer tenetur, auctor wisi? Voluptatibus reiciendis unde convallis justo incidunt? Itaque leo? Mollit odio ultricies asperiores 
        ullamco parturient sociosqu reiciendis incidunt consequat. Ut est? Impedit pellentesque fringilla eligendi? Mi ear </div>
        <div class="zwischenueberschrift_3">tpat eros, maiores totam cupi<br />
        </div>
        <div class="text">Excepteur saepe occaecati elit. Ex mauris do porttitor? Convallis molestie, consectetuer culpa. Voluptatum dolor ipsum adipiscing, quia, laudantium mi totam. Beatae quae. 
        Praesent excepturi, nemo fringilla similique quisquam sapiente totam fermentum fuga arcu . </div>
</objekt>       
</inhalt>
…
</body>
</html>        
HTML;

Output

<html>
<HEAD>
<style type="text/css">
.zwischenueberschrift_3, .zwischenueberschrift{
    color:red;
    font-weight: bold;
}
</style>
</HEAD>
<body>
<inhalt bez="text" sprache="DE">
    <objekt type="xhtml">
        <div class="text">Amet totam deleniti voluptate corporis wisi, donec alias, aspernatur enim leo, sunt, cursus sollicitudin pellentesque tortor rutrum rerum, magna tenetur aliquid ducimus 
        incidunt sociosqu, mollitia dicta totam gravida dignissi.</div>
        <span class="zwischenueberschrift_3">Nostra lobortis magni, eum dapibus nemo culpa augue ligula class! Magna ullamcorper! Reiciendis porttitor cum vitae. Beatae primis dictum adipisicing 
        deserunt duis posuere magna <br />
        </span>
        <div class="text">rovident viverra? Parturient egestas, condimentum! Mollitia anim? Ullam! Saepe dis, sem? Arcu esse class optio atque! Minima incidunt voluptatem porta eaque animi! Nibh dis, 
        tincidunt aliquet? Perferendis massa eiusmod eius? Aut, nobis! Explicabo penatib:<br />
            <ul>
                <li>reiciendis, laboris facilis bibendum, .<br />
                    <br />
                </li>
                <li>porro, facilis felis? Ridiculus dicta! Integer luctus laoreet rhoncus, habita max. 50 %  corrupti,  3.000 kg  habitant corrupti, . <br />
                    <br />
                </li>
                <li>tur quidem, eos consequat:<br />- Bis 23.09.2016 für Oktober, November, Dezember<br />- Bis 14.10.2016 für November, Dezember, Januar<br />- Bis 11.11.2016 für Dezember, Januar, 
                Februar<br />- Bis 09.12.2016 für Januar, Februar, März</li>
                <br />
                <li>ed imperdiet et phasellus adipiscing! Bibendum. Ad. Maiores pellentesque! Mauri.<br />
                    <br />
                </li>
                <li> Habitant dolore! Vestibulum! Conubia quaerat. Minima, nihil penatibus magna adipisci! Ultricies dignissim hic imperdiet. Tempus, distinctio.<br />
                    <br />
                </li>
                <li>e, quae beatae inceptos labore sunt excepturi id, neque saepe quae tellus. N bei 50 bis 80 % das 0,8-fache (11,2 Cent/kg), bei 20 bis 50% das 0,5-fache (7 Cent/kg) und bei 
                weniger als 20% tus dapibus occa.<br />
                    <br />
                </li>
                <li>la, platea. Reprehenderit   <br />
                    <br />
                </li>
                <li>celerisque convallis occaecat im Jahr 2017.  <br />
                    <br />
                </li>
                <li>Ddolore id? Ea sint? Netus quasi vulputate bis zum 23.09.2016 nutzen.</li>
            </ul>
        </div>
        <span class="zwischenueberschrift">litora magni assumenda! Magnis </span>
        <div class="text">llentesque consectetuer voluptatum purus ratione, temporibus deleniti eveniet ullamco eget nostrud? Sodales fusce. Nostrum culpa saepe quis penatibus accusantium? Sagittis 
        porttitor minima nunc ab fermentum incidunt class urna, tempor, ullamcorper quod beatae? Nostra cubilia felis? Mus pretium fames etiam, cras, velit nec quae, voluptates quas voluptas dis 
        inceptos porro dolorem ligula.t.<br /> <br />entium! Consectetuer tenetur, auctor wisi? Voluptatibus reiciendis unde convallis justo incidunt? Itaque leo? Mollit odio ultricies asperiores 
        ullamco parturient sociosqu reiciendis incidunt consequat. Ut est? Impedit pellentesque fringilla eligendi? Mi ear </div>
        <span class="zwischenueberschrift_3">tpat eros, maiores totam cupi<br />
        </span>
        <div class="text">Excepteur saepe occaecati elit. Ex mauris do porttitor? Convallis molestie, consectetuer culpa. Voluptatum dolor ipsum adipiscing, quia, laudantium mi totam. Beatae quae. 
        Praesent excepturi, nemo fringilla similique quisquam sapiente totam fermentum fuga arcu . </div>
</objekt>       
</inhalt>
…
</body>
</html>   

Sandbox

Unlike plain Regex It can handle things like this:

$html = <<<HTML
<div class="zwischenueberschrift_3">
    <div>
        <div>
            <div class="zwischenueberschrift">
                <div>Nostra lobortis </div>
                magni, eum dapibus nemo culpa augue ligula class!
            </div>
        </div>
    </div>
</div>
HTML;

Output

<span class="zwischenueberschrift_3">
    <div>
        <div>
            <span class="zwischenueberschrift">
                <div>Nostra lobortis </div>
                magni, eum dapibus nemo culpa augue ligula class!
            </span>
        </div>
    </div>
</span>

Sandbox

Or in other words nesting. Nesting is what kills a pure regex solution. In the case of a Regex it wouldn't know what ending tag goes to what starting tag as there is not a good way to track that, and no information in the end tag to use (they are all </div>).

There are ways to do recursive matching, but they tend to make things real complex and have their own set of issues.

What you will find (with Regex) is that the first open tag matches the first close tag, which is not correct. Basically this:

<div class="zwischenueberschrift_3"> <!-- this would be a span start tag -->
    <div>
        <div>
            <div class="zwischenueberschrift">
                <div>Nostra lobortis </div> <!-- this would be a span end tag -->

Sorry but I don't really have the time right now to explain how it works. It's kind of a complex thing. And would take some time to explain it all.

It wasn't clear in you question on what to do with the classes, so I just left them alone, but you could change that inside the IF here:

case 'T_START_DIV':                   
    if(preg_match('/zwischenueberschrift/i', $content)){
        ++$index;
        if(!isset($nesting[$index])) $nesting[$index] = 0;
        $content = str_replace('div', 'span', $content);
    }                      
    ++$nesting[$index];                   
break;

Enjoy!

ArtisticPhoenix
  • 21,464
  • 2
  • 24
  • 38
0

thanks for the support. For me the solution was @Pushpesh Kumar Rajwanshi

(?s)<div\s+class="zwischenueberschrift(?:_3)?">(.*?)<\/div>
webuser57
  • 171
  • 2
  • 9