-1

I am currently having trouble re-formating a div that I extracted from a website.

This is what I currently have:

<div class=" frame frame-default frame-type-textmedia frame-layout-0" id="c47903"><a id="c47904"/><div class="ce-textpic ce-left ce-above"><div class="ce-bodytext"><p>The latest data of the evolution of COVID-19 over the past 24hours <strong>in Québec</strong> reveal:</p><ul><li>87new cases, bringing the total number of infected persons to61,004;</li><li>no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718;</li><li>the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;</li><li>18,596tests were performed on August12, for a cumulative total of1,428,286.</li></ul></div></div></div> 

but I would like to have something similar to this:

The latest data of the evolution of COVID-19 over the past 24hours in Québecreveal: 87new cases, bringing the total number of infected persons to61,004; no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718; the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;18,596tests were performed on August12, for a cumulative total of1,428,286.

I removed it manually, but does there exists something that is less time consuming?

3 Answers3

0

Try something like:

soup.select_one('div[class="ce-bodytext"]').text.strip()

That should get you your expected output.

Jack Fleeting
  • 24,385
  • 6
  • 23
  • 45
0

Try this

text = r'<div class=" frame frame-default frame-type-textmedia frame-layout-0" id="c47903"><a id="c47904"/><div class="ce-textpic ce-left ce-above"><div class="ce-bodytext"><p>The latest data of the evolution of COVID-19 over the past 24hours <strong>in Québec</strong> reveal:</p><ul><li>87new cases, bringing the total number of infected persons to61,004;</li><li>no deaths have occurred in the past 24hours, to which are added 3deaths which occurred between August7 and12, for a total of5,718;</li><li>the number of hospitalizations increased by2 compared to the previous day, for a cumulative total of151. Of these, 25were in intensive care, an increase of2;</li><li>18,596tests were performed on August12, for a cumulative total of1,428,286.</li></ul></div></div></div>'
import re
print(re.sub(r'<[^<>]*>', ' ', text))
Kuldip Chaudhari
  • 1,112
  • 4
  • 8
0

try

str(bs4_obj.select('div')[0].text)

I don't know how to convert it from unicode, but it gets rid of the html tags.

Hadrian
  • 917
  • 5
  • 10