0

I am not good when using preg_match function

but I am trying to use it to find the first body tag.

the tag could be in any of the following formats

<body class="blah">
<body style="blah: blahblah;">
<body>

I was able to use preg_match() to get the first and the second example. But, it is not working on the last example. a simple <body> is not found.

Here is what I have done. $message is the string that I am trying to parse

$foundBody = preg_match('/<body(.*)>/i',$message, $bodyf);
        if($foundBody != false){
            $strPos = strpos($message, $bodyf[0]);
            echo $strPos .'<br><br>';
            echo $bodyf[0] . '<br><br>';
            echo strlen($bodyf[0]) . '<br><br>';


            if($strPos !== false){
                $message = substr($message, $strPos + strlen($bodyf[0]) );
            }               
        } 

NOTE: I am not prying to parse an html code. All I am trying to go here is to parse an email. I basically want to return a text begins immediately after <body....> tag to the end of the string.

Jaylen
  • 39,043
  • 40
  • 128
  • 221
  • 2
    Do not use regular expressions to parse a DOM tree. Use a tool crafted for that purpose instead, something like SimpleDOM or the like. – arkascha Mar 16 '15 at 20:35
  • @arkascha as I mentioned above. I am not trying to parse html I am trying to only find the first occurrence of tag – Jaylen Mar 16 '15 at 20:38
  • 1
    I read that. Still: for finding something you have to parse it. Now way around that. You can use some crude tool like a regex, or an elegant one. Your choice. – arkascha Mar 16 '15 at 20:40
  • Am I missing something. PHP is server side so should not need to even do this. It is there to generate good HTML in the first place. Why are you trying to fix something that should not be broke in the first place? – Ed Heal Mar 16 '15 at 20:50
  • He's parsing an HTML email. – Peter Bowers Mar 16 '15 at 20:54

3 Answers3

0

The following should print the content after the closing > of the <body> tag for all three cases:

$i=strpos($message, "<body");
$i=strpos($message, ">", $i);
echo substr($message, $i+1);
mti2935
  • 11,465
  • 3
  • 29
  • 33
0

I'm going to throw this solution down here and then run away very quickly before the bullets and hand grenades start flying... (Avoiding regex in connection with HTML has become a bit of a mantra on SO.)

(For the record, I agree that HTML processing should be done by something besides regex. However, playing with regex is fun. If the OP wants to have fun with regex ... why not?)

If you're already using preg_match why don't you just let preg_match do the whole thing for you:

if (preg_match('/^(.*?)<body([^>]*)>(.*)$/', $message, $matches)) {
    echo "Everything before the body tag = <pre>".$matches[1]."</pre><br />";
    echo "Attributes of the body tag = <pre>".$matches[2]."</pre><br />";
    echo "Everything after the body tag = <pre>".$matches[3]."</pre><br />";
} else {
   echo "OOPS! No body tag in that email!<br />\n";
}
Peter Bowers
  • 3,063
  • 1
  • 10
  • 18
0

I figured out a way to do it without having to do regex. I used the tidy() class

$tidy = new tidy();
$message = $tidy->repairString($message, array( 'output-html' => true, 'show-body-only' => true ), 'utf8');

for this to work tidy extension should be enabled in the PHP config file

Jaylen
  • 39,043
  • 40
  • 128
  • 221