-1

I have a legacy webpage to scrap using BS4. One of the section is a long essay that I need to scrap off. That essay is formatted strangely like this:

<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>

Using bs4, I tried the following: Using

soup.find('div', id='essay').text

I can extract

'this is paragraph1' and 'this is paragraph3'

OR

ps = soup.find('div', id='essay').find_all('p')
for p in ps:
    print p.text

I can extract

'this is paragraph2' and 'this is paragraph4'

If I use both, I will get paragraph 1, 3, 2, 4, which is out of order. I need to make sure the paragraph sequence is also correct. What can I do to achieve that?

EDIT: The problem in the question is only an example, it does not guarantee to interleave between even and odd number of paragraphs... Let me clarify my question a bit: I want to have a way to extract the paragraph IN SEQUENCE regardless of having < p > or not.

return 0
  • 4,226
  • 6
  • 47
  • 72
  • `soup.find('div', id="essay").text` gets exactly what you want so there must be more to your actual html. That or you are using some old buggy version of bs4 – Padraic Cunningham Jul 04 '16 at 22:02

3 Answers3

-1

BeautfulSoup4 also has recursive mode, which is enabled by default.

from bs4 import BeautifulSoup
html = """
<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
r = soup.find('div', id="essay", recursive=True).text
print (r)

Works perfectly for me. Try to update BeautifulSoup4 using pip.

O'Niel
  • 1,622
  • 1
  • 17
  • 35
-2

If the lists are the same length it might be easier to interleave them, instead of writing code to get around the original formatting with Beautiful Soup

from itertools import chain

list_a = ['this is paragraph1', 'this is paragraph3']
list_b = ['this is paragraph2', 'this is paragraph4']

print(list(chain.from_iterable(zip(list_a, list_b))))


# ['this is paragraph1', 'this is paragraph2', 'this is paragraph3', 'this is paragraph4']

More info here: Interleaving Lists in Python

Community
  • 1
  • 1
Jack Evans
  • 1,697
  • 3
  • 17
  • 33
-2

The following seems to work

import bs4

soup = bs4.BeautifulSoup("""
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
""", "lxml")

main = soup.find('div', id='essay')
for child in main.children:
    print(child.string)
Jack Evans
  • 1,697
  • 3
  • 17
  • 33
  • You have to specify your parser in the function BeautifulSoup. Otherwise you may get unwanted behaviours in other systems, and a big warning. – O'Niel Jul 04 '16 at 22:05