How to i split an HTML string into shorter HTML string in python? (added some intersing stuff)

Question

I'm having a really hard time with this one,

EDIT: I'm putting this edit at the top: if any one want to read the problem and more, you are very welcome, I kind of starting to solve is really hard issue, but getting into a new problem, the way I thought of is to just return all the long HTML page divided by the paragraphs ("p" tags). Up to here every thing is working and when i do assert False, i am getting every thing as i want it. then in the template i go over the list I've sent in the response and for each value (a paragraph) for now i am creating a div (a page in the book), here is the problem. I am getting every paragraph three times! code below...

assert (part of it):
<p style="text-align: center;">
<span style="font-size:24px;"><strong><u>The Ten Foot Stop</u></strong></span></p>,
<p  style="margin-bottom: 0.2in; text-align: center;">
<span style="font-size:18px;"><font style="font-size: 7pt;">NEWS AND OCCASIONAL ITEMS 
ABOUT THE MEDICAL ASPECTS OF SCUBA DIVING.<br />
POSTED BY ERN CAMPBELL, MD</font></span></p>

template:
{% for article_page in article_pages %}
    {% if article_page %} <!-- don't show an empty paragraph -->
       {{ article_page|safe }}
    {% endif %}
{% endfor %}

show this in page:
[The Ten Foot Stop, The Ten Foot Stop, The Ten Foot Stop]
<!-- first paragraph has: The Ten Foot Stop -->

from here is my original posts with all the issue description: I have a very long HTML like string (no head or body and stuff, but has tags and style, img tags and every thing else in it) and i need to split the string to smaller strings by number of words (need the string to fit into divs of certain sizes - lets say every 165 words more or less or even better to fit to certain height do it will fit the dive size- but i think that the second is much more complicated).

The problem i am having and tried every thing, including BeautifulSoup and other methods, is that i can't find a way to split the string while keeping the tags safe.... if i have a style tag for example, and the stag starts at the 160 char and go to the 170 char, the second page (div) will treat the styles as a regular string and BeautifulSoup only close "bad" tags as i saw, doesn't open the tags for the "bad" text in the second/third and so on divs....

And thought about using the truncate_html_words from text.py, but as the name implied, this only truncate words, doesn't save the rest of the text for the next page (or am i wrong)?

Any one has an idea about how to do this?

OK, Starting to figure this out slowly, i will publish it when it is done, i think people need this kind of thing. Next step is, I broke the html string by tags (in my case every HTML "p" tag. now how do i count the text and only the text in the tag? (ps. the tag might have child tags that wrap the text and might have multiple child tags also eg:

a
bcd

need to return only count of 2 - two words in tap)?

10x, Erez

Do you want to split in shown words? I guess all the style text shouldn't be part of the counted words, right? — solarc, Jun 04 '11 at 16:26
Sorry, I don't think i understand the question.... I want to split every 200 words (lets say 200) but without splitting in a middle of a HTML tag.... — Erez, Jun 04 '11 at 16:28
I didn't see the end of the question there, well, no tag will be shown when i parse it to html to show it in the site, but i get it in a string with html tags in it, i don't want to count the tags, no...just the words...10x :-) — Erez, Jun 04 '11 at 16:43
Could you describe what do you want by "fitting into divs of certain sizes"? There might be other ways to achieve your misson. — Shaung, Jun 04 '11 at 17:25
As part of the site i am building a book like app or newspaper), every page is a div. so i have to take the huge html string that some one entered in ckeditor and divide it to a div size (500 * 700) in my book app. Ckeditor create html tags for the text style and let you enter images and other things, so the text has lots of tags in it and the tags has to stay intact. — Erez, Jun 04 '11 at 17:39

score 1 · Accepted Answer · answered Jun 04 '11 at 21:42

Try starting small, define for yourself some sane, limited number of cases that you want to handle (like break on <p> tags, just show alt strings in place of images, and no divs), and see how that works. Then see if you want to tackle image sizing, or just show a hotspot for the use to select to see the image. Then the biggie is detecting divs. Start with just unnested divs, and get things working so that as you break up <p>s, you carry forward the current div's formatting. Then add nesting with a stack of formatting directives, pushing and popping off the stack as you encounter <div> and </div> tags.

But while your beginnings are simple, I would not be surprised if before long you find you are on the way to developing a complete browser.

repagination of text within screen size constraints
must handle modal style and formatting tags
must handle embedded images of varying size, presumably wrapping text around them

You didn't mention needing support for tables. If anchor tags with hrefs are defined, are these supposed to act as clickable hotspots? And God help you if you have to do something meaningful with JavaScript.

While you are carving off your simple starting point, see just how broad the end product requirements/expectations will have to be. If you start adding tables, frames, fonts, complex style directives, then you are essentially reinventing the web browser. At that point, try to inject some sanity back into the discussion - you are just one person and writing a browser is not a weekend task. Try to get the requirements down to a constrained set of supported tags. Alternatively, look into publicly available/open source browser engines (such as Chromium), which you might be able to adapt, especially in light of your simplified subset of features.

Hey Paul, Reading your answer i didn't know weather to laugh or cry, I am working on this problem for few days now and knew it was a massive job because i didn't get anywhere meaningful with it, as you said, it is like inventing the wheel all over again. The problem is then i don't know what the article will contain, might be a table, an anchor, an image that was bigger then the end result book page even, you know that sometimes your designer gives the client a great/crazy idea and then you have to figure it out, and no one understand the sense of "impossible for the frame of time or budget".. — Erez, Jun 04 '11 at 22:14
This is where i am standing right now.... A huge problem that i need to solve..... Still can't really see how to do it really, but I'm still not giving up.... I'm actually in a stage that i am thinking on tweaking the CKEditor to divide the page while the user is entering the article.... I Don't know even if the license of CKEditor permits that... need to see if it is possible, but i think that this will be much easier then working on it after the text is already saved as HTML in the DB and trying to parse to split it now.... — Erez, Jun 04 '11 at 22:17
@Erez I think it's difficult since each tag might have its own styles specified, and you will be calculating all the size, position and stuff, and it's hard to make it right. Does it worth the efforts? I would suggest just put each paragraph in a div and let the css do the rest -- bad-looking but rather quick. Or you need to restrict the user input to some kind of predefined structure to get more control over the layout. — Shaung, Jun 05 '11 at 13:40

score 0 · Answer 2 · edited Dec 09 '21 at 09:21

I see you are splitting trying to keep the html tags intact. I was simply looking for a solution to split a very long html string every n charachter and create some smaller strings adding them to a .txt file every new line. Then in my application I use these smaller strings to send chunks of a webpage from server to client. I have posted my working script here: https://stackoverflow.com/a/70287092/13795525

How to i split an HTML string into shorter HTML string in python? (added some intersing stuff)

2 Answers2