3

I'm looking for a way to split a string containing HTML in to two halves. Requirements:

  • Split a string by a number of chars
  • Must not split in the middle of a word
  • Must not include HTML chars when calculating where to split the string

For example take the following string:

<p>This is a test string that contains <strong>HTML</strong> tags and text content. This string needs to be split without slicing through the <em>middle</em> of a word and must preserve the validity of the HTML, i.e. not split in the middle of a tag, and make sure closing tags are respected correctly.</p>

Say I want to split at char position 39, middle of word HTML (not counting html), I would want the function to split the string in to the following two parts:

<p>This is a test string that contains <strong>HTML</strong></p>

and

<p>tags and text content. This string needs to be split without slicing through the <em>middle</em> of a word and must preserve the validity of the HTML, i.e. not split in the middle of a tag, and make sure closing tags are respected correctly.</p>

Notice in the above two example results I would require the the HTML validity be respected, so the closing </strong> and </p> tags were added. Also a starting <p> tag was added to second half as one it closed at the end of the string.

I found this function on StackOverflow to truncate a string by a number of text chars and preserve HTML, but it only goes halfway to want I need, as I need to split in to two halves.

function printTruncated($maxLength, $html)
{
    $printedLength = 0;
    $position = 0;
    $tags = array();

    while ($printedLength < $maxLength && preg_match('{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}', $html, $match, PREG_OFFSET_CAPTURE, $position))
    {
        list($tag, $tagPosition) = $match[0];

        // Print text leading up to the tag.
        $str = substr($html, $position, $tagPosition - $position);
        if ($printedLength + strlen($str) > $maxLength)
        {
            print(substr($str, 0, $maxLength - $printedLength));
            $printedLength = $maxLength;
            break;
        }

        print($str);
        $printedLength += strlen($str);

        if ($tag[0] == '&')
        {
            // Handle the entity.
            print($tag);
            $printedLength++;
        }
        else
        {
            // Handle the tag.
            $tagName = $match[1][0];
            if ($tag[1] == '/')
            {
                // This is a closing tag.

                $openingTag = array_pop($tags);
                assert($openingTag == $tagName); // check that tags are properly nested.

                print($tag);
            }
            else if ($tag[strlen($tag) - 2] == '/')
            {
                // Self-closing tag.
                print($tag);
            }
            else
            {
                // Opening tag.
                print($tag);
                $tags[] = $tagName;
            }
        }

        // Continue after the tag.
        $position = $tagPosition + strlen($tag);
    }

    // Print any remaining text.
    if ($printedLength < $maxLength && $position < strlen($html))
        print(substr($html, $position, $maxLength - $printedLength));

    // Close any open tags.
    while (!empty($tags))
        printf('</%s>', array_pop($tags));
}
Camsoft
  • 11,718
  • 19
  • 83
  • 120
  • The answer given [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) applies to your question as well – Philar Nov 28 '10 at 13:14

1 Answers1

4

The general rule you'll be quoted by almost all other answers is "do not process HTML with regex - you can't capture all the edge cases"

I believe this to be quite true

Anything even slightly malformed in your string, and even the best-crafted regular expression will still mess it up

Ignoring that you want to split some tags and not others (p-tags are tags, after all, and you're looking to split one into two), you may need to rethink the process, and get very specific about what you're wanting to achieve e.g. is splitting in the middle of a paragraph tag okay? What about divs? If middle point is inside a tag, do you want the first string to be longer, or the second?

Assuming that splitting paragraph tags is okay, but others aren't, try an approach as follows: (no copy-paste code here, sorry) * Strip the target string twice - once of all tags, and once of just paragraph tags * Find the middle point in the no-tags-at-all string * Split the no-tags-at-all string at first space after middle point * Find the spot in the just-p-tags-stripped string that matches the word/words just after the middle point in previous step - this should tell you where in the just-p-tags-stripped string is 'the middle' when tags are ignored * Check to see if you're inside a tag.

.. actually, just as I got to this point I realised that 90% of what I wrote is pretty darned obvious, and that the last dot-point is precisely where the problem is

I'm going to leave my half-finished rant here at a warning to others, and to myself..

ebonhand
  • 346
  • 1
  • 8