Reformat line/div extracted from html

Question

I am currently having trouble re-formating a div that I extracted from a website.

This is what I currently have:

<div class=" frame frame-default frame-type-textmedia frame-layout-0" id="c47903"><a id="c47904"/><div class="ce-textpic ce-left ce-above"><div class="ce-bodytext"><p>The latest data of the evolution of COVID-19 over the past 24hours <strong>in Québec</strong> reveal:</p><ul><li>87new cases, bringing the total number of infected persons to61,004;</li><li>no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718;</li><li>the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;</li><li>18,596tests were performed on August12, for a cumulative total of1,428,286.</li></ul></div></div></div>

but I would like to have something similar to this:

The latest data of the evolution of COVID-19 over the past 24hours in Québecreveal: 87new cases, bringing the total number of infected persons to61,004; no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718; the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;18,596tests were performed on August12, for a cumulative total of1,428,286.

I removed it manually, but does there exists something that is less time consuming?

Does this answer your question? [Extracting text from HTML file using Python](https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) — TescoOne, Aug 14 '20 at 15:27
@TheLazyScripter Just a bunch of str(div).replace("what I want gone","") — paul lacher, Aug 14 '20 at 15:27
If you've captured the tag with `bs4`, try using `get_text()` on the tag — TheLazyScripter, Aug 14 '20 at 15:28

score 0 · Answer 1 · answered Aug 14 '20 at 15:28

0

Try something like:

soup.select_one('div[class="ce-bodytext"]').text.strip()

That should get you your expected output.

answered Aug 14 '20 at 15:28

Jack Fleeting

24,385
6
23
45

score 0 · Accepted Answer · answered Aug 14 '20 at 15:28

Try this

text = r'<div class=" frame frame-default frame-type-textmedia frame-layout-0" id="c47903"><a id="c47904"/><div class="ce-textpic ce-left ce-above"><div class="ce-bodytext"><p>The latest data of the evolution of COVID-19 over the past 24hours <strong>in Québec</strong> reveal:</p><ul><li>87new cases, bringing the total number of infected persons to61,004;</li><li>no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718;</li><li>the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;</li><li>18,596tests were performed on August12, for a cumulative total of1,428,286.</li></ul></div></div></div>'
import re
print(re.sub(r'<[^<>]*>', ' ', text))

you can try **f"{variableforthevalue}" ** in **Python 3** for **Python 2** you have to use **"%s"%(variableforthevalue)** — Kuldip Chaudhari, Aug 14 '20 at 15:45

score 0 · Answer 3 · answered Aug 14 '20 at 15:34

0

try

str(bs4_obj.select('div')[0].text)

I don't know how to convert it from unicode, but it gets rid of the html tags.

answered Aug 14 '20 at 15:34

Hadrian

917
5
10

Reformat line/div extracted from html

3 Answers3