How to find the first tag in a text using PHP?

Question

I am not good when using preg_match function

but I am trying to use it to find the first body tag.

the tag could be in any of the following formats

<body class="blah">
<body style="blah: blahblah;">
<body>

I was able to use preg_match() to get the first and the second example. But, it is not working on the last example. a simple <body> is not found.

Here is what I have done. $message is the string that I am trying to parse

$foundBody = preg_match('/<body(.*)>/i',$message, $bodyf);
        if($foundBody != false){
            $strPos = strpos($message, $bodyf[0]);
            echo $strPos .'<br><br>';
            echo $bodyf[0] . '<br><br>';
            echo strlen($bodyf[0]) . '<br><br>';


            if($strPos !== false){
                $message = substr($message, $strPos + strlen($bodyf[0]) );
            }               
        }

NOTE: I am not prying to parse an html code. All I am trying to go here is to parse an email. I basically want to return a text begins immediately after <body....> tag to the end of the string.

Do not use regular expressions to parse a DOM tree. Use a tool crafted for that purpose instead, something like SimpleDOM or the like. — arkascha, Mar 16 '15 at 20:35
@arkascha as I mentioned above. I am not trying to parse html I am trying to only find the first occurrence of tag — Jaylen, Mar 16 '15 at 20:38
I read that. Still: for finding something you have to parse it. Now way around that. You can use some crude tool like a regex, or an elegant one. Your choice. — arkascha, Mar 16 '15 at 20:40
Am I missing something. PHP is server side so should not need to even do this. It is there to generate good HTML in the first place. Why are you trying to fix something that should not be broke in the first place? — Ed Heal, Mar 16 '15 at 20:50

score 0 · Answer 1 · answered Mar 16 '15 at 20:38

0

The following should print the content after the closing > of the <body> tag for all three cases:

$i=strpos($message, "<body");
$i=strpos($message, ">", $i);
echo substr($message, $i+1);

answered Mar 16 '15 at 20:38

mti2935

11,465
3
29
33

Peter Bowers · Answer 2 · 2015-03-16T21:27:31.767

0

I'm going to throw this solution down here and then run away very quickly before the bullets and hand grenades start flying... (Avoiding regex in connection with HTML has become a bit of a mantra on SO.)

(For the record, I agree that HTML processing should be done by something besides regex. However, playing with regex is fun. If the OP wants to have fun with regex ... why not?)

If you're already using preg_match why don't you just let preg_match do the whole thing for you:

if (preg_match('/^(.*?)<body([^>]*)>(.*)$/', $message, $matches)) {
    echo "Everything before the body tag = <pre>".$matches[1]."</pre><br />";
    echo "Attributes of the body tag = <pre>".$matches[2]."</pre><br />";
    echo "Everything after the body tag = <pre>".$matches[3]."</pre><br />";
} else {
   echo "OOPS! No body tag in that email!<br />\n";
}

edited Mar 16 '15 at 21:27

answered Mar 16 '15 at 20:48

Peter Bowers

3,063
1
10
18

\*cocks really big gun\* – rjdown Mar 16 '15 at 20:51
Somewhere on SO there is a really cute answer re regex & html - makes it look like the guy is going crazy... Anybody know where it is? – Peter Bowers Mar 16 '15 at 20:53
Shouldn't you use non-greedy maches like (.*?) instead of (.*) ? – Anders Lindén Mar 16 '15 at 20:55
@PeterBowers this code is not working. it does not find the body' – Jaylen Mar 16 '15 at 21:10
Really? Is it not matching or not printing what you want after it matches? – Peter Bowers Mar 16 '15 at 21:33

score 0 · Answer 3 · answered Mar 16 '15 at 21:28

I figured out a way to do it without having to do regex. I used the tidy() class

$tidy = new tidy();
$message = $tidy->repairString($message, array( 'output-html' => true, 'show-body-only' => true ), 'utf8');

for this to work tidy extension should be enabled in the PHP config file

How to find the first tag in a text using PHP?

3 Answers3