You shouldn't use regex for html parsing of course, but this should separate out
content should you want to. I have limited knowledge of php so this just illustrates procedure.
$tags =
' <
(?:
/?\w+\s*/?
| \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
| !(?:DOCTYPE.*?|--.*?--)
)>
';
$scripts =
' <
(?:
(?:script|style) \s*
| (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
)>
.*?
</(?:script|style)\s*>
';
$regex = / ($scripts | $tags) | ((?:(?!$tags).)+) /xsg;
The replacement string is Group1 catted to the return value of your
word wrap function (which is passed the content, Group2 string)
so something like: replacement = \1 . textwrap( \2 )
Inside of textwrap you decide what to do with the content.
Tested in Perl (btw its very slow and watered down for clarity):
use strict;
use warnings;
my $tags =
' <
(?:
/?\w+\s*/?
| \w+\s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*/?
| !(?:DOCTYPE.*?|--.*?--)
)>
';
my $scripts =
' <
(?:
(?:script|style) \s*
| (?:script|style) \s+ (?:".*?"|\'.*?\'|[^>]*?)+\s*
)>
.*?
</(?:script|style)\s*>
';
my $html = join '', <DATA>;
while ( $html =~ / ($scripts | $tags) | ((?:(?!$tags).)+) /xsg ) {
if (defined $2 && $2 !~ /^\s+$/) {
print $2,"\n";
}
}