-1

How to distinguish <div> from <?php ?> using php regex ? div is just an example. I need in <smth> <?php code ?> </smth> distinguish between smth and ?php code ?, where smth and code may be any characters combination.

I would like to get content of <div>, but not of <?php ? >, and sometimes vise versus

$regex1 = '#<(?<!\?)(.*?)>#' ;  // result : div , ?php ? , /div  . 

I do not need php in this case.

$regex1 = '#<(?<!\??)(.*?)>#' ;  //Compilation failed: lookbehind assertion is not fixed length at offset 8

The second question would be how to retrieve inside <?php ?> inside <div> and similar html markup

$htmlStr = " before <div> inside <?php ?> </div> after ";
        $regex1 = '#(.*)'  //before 
                . '<(?!\?)' // < not followed by ?
                . '(.*)' // div
                . '((?<!\?)>)'// > not preceeded by ?
                . '(.*)' // Retrieves only 'inside', instead of 'inside  <?php ? >'
                . '</' //  </
                . '.*'  // div
                . '((?<!\?)>)'   // > not preceeeded by ?
                . '(.*)#'; // after 

I also have tried non-greedy expression :

$regex1 = '#(.*)'  //before 
        . '<(?!\?)' // < not followed by ?
        . '(.*)' // div
        . '((?<!\?)>)'// > not preceeded by ?
        . '(.*?)' // // non-greedy expression fetch only 'inside', but i need 'inside  <?php ? >'
        . '</' //  </
        . '.*'  // div
        . '((?<!\?)>)'   // > not preceeeded by ?
        . '(.*)#'; // after 

Final adjustment. DOM also does not rerieve value if it contains . The code below lineforeach ($els as $el) { echo '<br><br>element value ='. $el->nodeValue; } echoes inside instead of inside <?php ?>

$htmlStr = " before <div> inside <?php ?> </div> after ";      
$regex1 = '#(.*)<([a-zA-Z]+)>'// > not preceeded by ?  [a-zA-Z]
        . '(.*)' // greedy expression fetch only 'inside', but i need 'inside  <?php ? >'  (?=<\/)' 
        . '</' //  </
        . '[a-zA-Z]+'  // div
        . '(?<!\?)>'   // > not preceeeded by ?
        . '(.*)#'; // after  */
preg_match_all($regex1, $htmlStr, $attrArr1, 0); //input
$attrArr1 = array_filter($attrArr1);
print_r('<br><br> 619 htmlStr=' . $htmlStr. ',   attrArr1 = <pre>'); print_r($attrArr1); 

$dom = new \DOMDocument('1.0'); // name, value 
$dom->loadHTML($htmlStr); 
$ansArr['elType'] = $attrArr1[2][0];
//$els = $dom->getElementsByTagName('*'); // To be done 
$els = $dom->getElementsByTagName($ansArr['elType'] );
foreach ($els as $el) { echo '<br><br>element value ='. $el->nodeValue; } //gives value 'inside'
print_r('<br><br>620 elType='.$ansArr['elType'].',   els='); print_r($els); 
olga
  • 959
  • 1
  • 15
  • 42
  • 3
    Use a parser instead. The xpath query will be sth. like `$xpath->query("//div");`. – Jan Feb 21 '17 at 13:01
  • 5
    [Here be dragons](http://stackoverflow.com/a/1732454/354577). While it may be possible to handle specific narrow use cases with regular expressions in general it is **_literally not possible_** to parse HTML with regex. It's almost always better to use a proper XML / HTML parser like [`DOMDocument`](https://secure.php.net/manual/en/class.domdocument.php) or an XML query language like [XPath](https://en.wikipedia.org/wiki/XPath) as suggested by Jan. – ChrisGPT was on strike Feb 21 '17 at 13:05

2 Answers2

1

answer to your first question :

( 'get content of <­smth> but not of <­?php ?>' ) demo

input  >> <smth id="test">Hello World!</smth><?php echo "Hello World!"?>
regex  >> (?<=<smth\s)(.*?)(?=>)
output >> id="test"

answer to your second question :

('retrieve inside <­?php ?>') demo

input  >> <smth id="test">Hello World!</smth><?php echo "Hello World!"?>
regex  >> (?<=<\?php\s)(.*?)(?=\?>)
output >> echo "Hello World!"

('retrieve inside <­smth> and similar html markup') demo

input  >> <smth id="test">Hello World!</smth><?php echo "Hello World!"?>
regex  >> (?<=<smth[\s]id="test">).*?(?=<\/smth>)  // not efficient
output >> Hello World!

Hope these helps!

m87
  • 4,445
  • 3
  • 16
  • 31
  • I am sorry, `div` is just an example. I need – olga Feb 21 '17 at 16:29
  • @olga check updated answer. now it should work for `` – m87 Feb 21 '17 at 18:03
  • i mean i do not know in advance if it will be `smth` of `div` or `span` or `input`, etc... `smth` shall not be part of regex. Thank you for answer. It has a good point. Sorry, for not clear enought question. – olga Feb 21 '17 at 19:07
1

Seems the regex works, just it does not print the <?php ?>, because it is not a string in webrowser content. Nevertheless, if you add some text after ,you will get everything. Also if you write result to file, you will find not only word inside, but also inside <?php ?> smth

   //$regex2 ='#(.*)<([a-zA-Z]+)(.*?)(?<!\?)>(.*)</[a-zA-Z]+(?<!\?)>(.*)#'; for textarea, also div/span/p/a and other elements having closing tags.

//$regex3 ='#(.*)<([a-zA-Z]+)(.*?)/>(.*)#'; //for input

// Textarea
$htmlStr2 = " before <textarea name='<?php  ?>' > inside '<?php ? >' smth. </textarea> after ";   
//$regex2 ='#(.*)<([a-zA-Z]+)(.*?)(?<!\?)>(.*)</[a-zA-Z]+(?<!\?)>(.*)#'; for textarea
$regex2 = '#(.*)<([a-zA-Z]+)' // > not preceeded by ?  textarea[a-zA-Z]+
        . '(.*?)' //  attributes (.*?)
        . '(?<!\?)>'// > not preceeded by ?  
        . '(.*)' // fetch <?php ? >
        . '</' //  </
        . '[a-zA-Z]+'  // div
        . '(?<!\?)>'   // > not preceeeded by ?
        . '(.*)#'; // after  
preg_match_all($regex2, $htmlStr2, $attrArr, 0); //input
$attrArr = array_filter($attrArr); //array_filter removes empty values
print_r('<br><br> 619 htmlStr=' . $htmlStr. ',   attrArr1 = <pre>'); print_r($attrArr); 
$resStr =  print_r($attrArr, true); 
print_r('<br><br> resStr='.$resStr);
file_put_contents('C:\\Users\\gintare\\Documents\\reg2.txt', $resStr);
$before = $attrArr1[1][0]; // before
$tagType = $attrArr1[2][0]; //input
$tagAttrStr = $attrArr1[3][0]; //name='<?php  ? >'
$inside = $attrArr1[4][0]; //inside '<?php ? >' smth.
$afer = $attrArr1[5][0]; //after

//Input  
$htmlStr3 = " before <input class='inp' > after '<?php ? >' smth. ";  
//$regex3 ='#(.*)<([a-zA-Z]+)(.*?)/>(.*)#'; //for input
$regex3 = '#(.*)<([a-zA-Z]+)' // > not preceeded by ?  input[a-zA-Z]+
        . '(.*?)' //  attributes (.*?)
        . '/>'//
        . '(.*)'; // after
preg_match_all($regex3, $htmlStr3, $attrArr, 0); //input
$attrArr = array_filter($attrArr); //array_filter removes empty values
print_r('<br><br> 619 htmlStr=' . $htmlStr. ',   attrArr1 = <pre>'); print_r($attrArr); 
$resStr =  print_r($attrArr, true); 
print_r('<br><br> resStr='.$resStr);
file_put_contents('C:\\Users\\gintare\\Documents\\reg3.txt', $resStr);
$before = $attrArr1[1][0]; // before
$tagType = $attrArr1[2][0]; //input
$tagAttrStr = $attrArr1[3][0]; //name='<?php  ? >'
$after = $attrArr1[4][0]; //after
olga
  • 959
  • 1
  • 15
  • 42