preg_split but ignore XML and HTML entities

Question

I using this php code to split a string roughly every 120 chars. It splits at the closest space. But it splits HTML and XML entities, so it sometimes outputs things like id="id">. How can I make it so it ignores XML and HTML entities, but does not remove them.

function splitWords($string, $max = 1)
{
    $words = preg_split( '/\s/', $string );
    $lines = array();
    $line = '';

    foreach ( $words as $k => $word ) {
        $newLine = $line . ' ' . $word;
        $length = strlen( $newLine );
        if ( $length <= $max ) {
            $line .= ' ' . $word;
        } else if ( $length > $max ) {
            if ( !empty( $line ) ) {
                $lines[] = trim( $line );
            }
            $line = $word;
        } else {
            $lines[] = trim( $line ) . ' ' . $word;
            $line = '';
        }
    }
    $lines[] = ( $line = trim( $line ) ) ? $line : $word;

    return $lines;
}

Maybe you could use [DOMDocument](http://www.php.net/manual/en/domdocument.loadhtml.php) and iterate through it? — Wiktor, Aug 20 '13 at 14:10
What for? Because if it's for e-mails, [that's what `quuoted_printable_encode()` is for](http://php.net/manual/en/function.quoted-printable-encode.php) — Wrikken, Aug 21 '13 at 14:57
(If not for e-mail, [`XMLReader::readString()`](http://www.php.net/manual/en/xmlreader.readstring.php) is a good starting point, _if_ it's supported in your version). — Wrikken, Aug 21 '13 at 15:10

score 1 · Answer 1 · edited May 23 '14 at 23:45

Description

I would change your split command to use tag substrings as a delimiter or the space.

This basic regex will:

match tags or will match spaces
it will not match spaces inside tags
will avoid many of the pitfalls with pattern matching html text

<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s

enter image description here

With this regex you can do all sorts of crazy things depending on where you place the capturing paranthesse and the options used in preg_split.

Examples

Live Demo

Note that in this demo the anchor tags have some seriously difficult edge cases.

PHPv5.4.4 Code

<?php

$string = ' <a onmouseover=\' <a href="notreal.com">This is text inside an attribute</a> \' href=url.com>This is some inner text</a>This is outer text.

    <a onmouseover=\' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; \'  href=\'http://InterestedURL.com\' id=\'revSAR\'>
        I am the inner text too.
        </a>
';

echo "split retains all spaces\n";
$array = preg_split ('/(<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s)/', $string, 0, PREG_SPLIT_DELIM_CAPTURE); 
echo implode(",",$array);

echo "\n\nsplit ignores spaces\n";
$array = preg_split ('/(<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>)|\s/', $string, 0, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY); 
echo implode(",",$array);

echo "\n\nsplit ignores tags and spaces\n";
$array = preg_split ('/<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|\s/', $string, 0,  PREG_SPLIT_NO_EMPTY); 
echo implode(",",$array);

echo "\n\nsplit ignores tags and retains spaces\n";
$array = preg_split ('/<\/?\w+(?=\s|>)(?:[^>=|&)]*|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*>|(\s)/', $string, 0,  PREG_SPLIT_DELIM_CAPTURE); 
echo implode(",",$array);

Output

You're probably most interested in the third option "split ignores tags and spaces"

split retains all spaces
,   ,,<a onmouseover=' <a href="notreal.com">This is text inside an attribute</a> ' href=url.com>,This, ,is, ,some, ,inner, ,text,</a>,This, ,is, ,outer, ,text.,
,,
,,  ,,<a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; '  href='http://InterestedURL.com' id='revSAR'>,,
,,  ,,  ,I, ,am, ,the, ,inner, ,text, ,too.,
,,  ,,  ,,</a>,,
,

split ignores spaces
<a onmouseover=' <a href="notreal.com">This is text inside an attribute</a> ' href=url.com>,This,is,some,inner,text,</a>,This,is,outer,text.,<a onmouseover=' a=1; href="www.NotYourURL.com" ; if (3 <a && href="www.NotYourURL.com" && id="revSAR" && 6 > 3) { funRotate(href) ; } ; '  href='http://InterestedURL.com' id='revSAR'>,I,am,the,inner,text,too.,</a>

split ignores tags and spaces
This,is,some,inner,text,This,is,outer,text.,I,am,the,inner,text,too.

split ignores tags and retains spaces
,   ,,This, ,is, ,some, ,inner, ,text,This, ,is, ,outer, ,text.,
,,
,,  ,,,
,,  ,,  ,I, ,am, ,the, ,inner, ,text, ,too.,
,,  ,,  ,,,
,

@Cole"Cole9"Johnson did you downvote and for that reason? Do you have any specific test case where this fails? — John Dvorak, Aug 21 '13 at 15:00
@JimDvorak yes. I did downvote because of that. [Here's why](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg). Seriously. The answerer has 5k rep and is suggesting to parse HTML with regex! — Cole Tobin, Aug 21 '13 at 15:03
@Cole"Cole9"Johnson, so it works, but you just don't agree with the process. Fair enough, everyone has an opinion. — Ro Yo Mi, Aug 21 '13 at 18:37
Did you even read the linked question? CLEARLY, there are cases where regex+HTML _will_ fail. If there are cases where it can fail, it doesn't work. **"Oh, my computer says there's an update to fix a security hole. Oh that's not something I need; my computer works just fine. (1 month later) My computer won't boot and I don't know why!"** That's how you are viewing this. It may work now, but it _will_ fail. Seriously, if you want to parse HTML, the W3C defined a **DOM Parser** that nearly every language has an implementation of. **Just use _that_.** — Cole Tobin, Aug 21 '13 at 22:03
@Denomales However, I would like to know what you used to make that _awesome_ flowchart. — Cole Tobin, Aug 21 '13 at 22:04
@Cole"Cole9"Johnson ... **you don't** *need* to **bold half of your entire comment**; there's no **point in bolding everything** because then **nothing is** emphasized **anymore** — tckmn, Aug 22 '13 at 02:51
@Cole"Cole9"Johnson, I don't disagree with your logic, however the OP was already using regex to split an html. Regarding your link my solution works with multi-line files, embedded `<` or `>` strings inside attributes, The nested tags doesn't apply to this request because they requester just wanted to "skip" html tags, and comments wern't addressed by the this requester as a concern. — Ro Yo Mi, Aug 22 '13 at 05:13
@Cole"Cole9"Johnson, parsing html with regex is wrong. This question is about pattern matching. I would like to point out that as of this writing mine is still the only offered answer to a 24 hour old question. You first commented on my answer 14 hours ago. With your religious devotion to your position perhaps you should also offer a proposed solution. — Ro Yo Mi, Aug 22 '13 at 05:23
@Cole"Cole9"Johnson, regarding the chart, that was generated on http://www.debuggex.com/. They have an amazing tool for visualizing regular expressions (including PCRE). They are in beta. I'm not affiliated with them in anyway other than providing feed back on the tool. — Ro Yo Mi, Aug 22 '13 at 05:25

preg_split but ignore XML and HTML entities

1 Answers1

Description

Examples