4

I have some fairly large paragraphs (5000-6000 words) containing text and embedded html tags. I want to break this large paragraph in chunks of 1500 words (ignoring the html markup in it) i.e 1500 should include only actual words and not any markup words. Using function strip_tags i can count the number of words (ignoring the html markup), but i'm not able to figure out how to break it in chunks of 1500 words (still including html markup). For example

This is <b> a </b> paragraph which <a href="#"> has some </a> some text to be broken in <h1> 5 words </h1>.

The result should be

1 = This is <b> a </b> paragraph which
2 = <a href="#"> has some </a> some text to
3 = be broken in <h1> 5 words </h1>. 
prashant
  • 3,570
  • 7
  • 40
  • 50
  • How do you plan to handle chunks where an html tag happens over your border? Would it matter if you had an open tag but no close tag in one of your chunks? – glenatron Dec 18 '12 at 14:50
  • @glenatron That is exactly what his question is. He wants to break the paragraphs in smaller bits, but he wants to keep his tags intact. – Jelmer Dec 18 '12 at 14:50
  • @Jelmer I interpret the question to mean that he wants to break it into blocks of a certain number of words and any number of html tags among those words. Whether the markup retains any significance ( or needs to still be semantically correct html ) is relevant to how one would do this. – glenatron Dec 18 '12 at 14:55
  • @prashant, what if you break words in one tag block like `two words`, should it be `1 = two` , `2 = words`? Or just `1 = two` , `2 = words` – Laz Karimov Dec 18 '12 at 14:59
  • @glenatron : interesting question, as i think more of the problem i'll need to keep the closing tag intact in order to have correct markup. I think it would be ok if we can pull the text till end of closing markup (if that simlifies the problem), so the actual text can go beyond 1500 markup. – prashant Dec 18 '12 at 14:59
  • Have a look at http://jhollingworth.github.com/bootstrap-wysihtml5/ because when you make some text and at random starting to add `U`, `B` and `I` tags (by the buttons) you can see when they overlap they are ending and reopening again inside the nested tag. I think that's the better solution instead of ending the `P` and reopening it without ending the markup. You will get xHTML validation errors for sure if you don't. @glenatron ah, I misunderstood :) – Jelmer Dec 18 '12 at 15:21

3 Answers3

2

Think about using explode() function wisely. Or better, but longer - regular expression that will match either a word or a tag with all text within it. You should consider elements inside html tags as unbreakable entity. For example, you can write a function, that breaks you large paragraph into following array of entities:

$data = array(
  array( "count" => 2, "text" => "This is "),
  array( "count" => 1, "text" => "<b> a </b>"),
  array( "count" => 2, "text" => " paragraph which"),
  ...
  etc.
);

Then, you should write a loop, that will make small paragraphs from $data array.

Also, sometimes it won't be possible to make your paragraph exactly 1500 words long. It can be more or less, because you should not separate you html tags.

Serge Kuharev
  • 1,052
  • 6
  • 16
  • **Do not** parse HTML with regular expressions http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – dualed Dec 18 '12 at 16:39
1

I think you're going to need to parse your html if you want to guarantee valid markup. In which case this question should provide a really useful starting point.

Community
  • 1
  • 1
glenatron
  • 11,018
  • 13
  • 64
  • 112
0

Use an XML DOM Parser or an HTML DOM Parser.

  • Iterate over all nodes
  • Count words for each node
  • If words exceeds N
    • create new node of parent type
    • insert that as sibling after parent
    • move current and all subsequent siblings to it.
  • move to next element
Community
  • 1
  • 1
dualed
  • 10,262
  • 1
  • 26
  • 29
  • @prashant If it's a text node you can split it quite safely, and here you *can* use regex, split and whatnot. – dualed Dec 18 '12 at 17:34