I have a legacy webpage to scrap using BS4. One of the section is a long essay that I need to scrap off. That essay is formatted strangely like this:
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
Using bs4, I tried the following: Using
soup.find('div', id='essay').text
I can extract
'this is paragraph1' and 'this is paragraph3'
OR
ps = soup.find('div', id='essay').find_all('p')
for p in ps:
print p.text
I can extract
'this is paragraph2' and 'this is paragraph4'
If I use both, I will get paragraph 1, 3, 2, 4, which is out of order. I need to make sure the paragraph sequence is also correct. What can I do to achieve that?
EDIT: The problem in the question is only an example, it does not guarantee to interleave between even and odd number of paragraphs... Let me clarify my question a bit: I want to have a way to extract the paragraph IN SEQUENCE regardless of having < p > or not.