Using BeautifulSoup4 to get string with/without
tag inside
section

Question

I have a legacy webpage to scrap using BS4. One of the section is a long essay that I need to scrap off. That essay is formatted strangely like this:

<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>

Using bs4, I tried the following: Using

soup.find('div', id='essay').text

I can extract

'this is paragraph1' and 'this is paragraph3'

OR

ps = soup.find('div', id='essay').find_all('p')
for p in ps:
    print p.text

I can extract

'this is paragraph2' and 'this is paragraph4'

If I use both, I will get paragraph 1, 3, 2, 4, which is out of order. I need to make sure the paragraph sequence is also correct. What can I do to achieve that?

EDIT: The problem in the question is only an example, it does not guarantee to interleave between even and odd number of paragraphs... Let me clarify my question a bit: I want to have a way to extract the paragraph IN SEQUENCE regardless of having < p > or not.

`soup.find('div', id="essay").text` gets exactly what you want so there must be more to your actual html. That or you are using some old buggy version of bs4 — Padraic Cunningham, Jul 04 '16 at 22:02

O'Niel · Answer 1 · 2016-07-04T22:07:32.040

-1

BeautfulSoup4 also has recursive mode, which is enabled by default.

from bs4 import BeautifulSoup
html = """
<div id='essay'>
  this is paragraph1
  <p>this is paragraph2</p>
  this is paragraph3
  <p>this is paragraph4</p>
</div>
"""

soup = BeautifulSoup(html, "html.parser")
r = soup.find('div', id="essay", recursive=True).text
print (r)

Works perfectly for me. Try to update BeautifulSoup4 using pip.

edited Jul 04 '16 at 22:07

answered Jul 04 '16 at 21:57

O'Niel

1,622
1
17
35

1

The default is `recursive=True` – Padraic Cunningham Jul 04 '16 at 21:59
@PadraicCunningham Thanks, I've edited that. – O'Niel Jul 04 '16 at 22:02
My point is that `soup.find('div', id="essay").text` works and is the same as `soup.find('div', id="essay", recursive=True).text` but the OP already stated that it did not for them – Padraic Cunningham Jul 04 '16 at 22:03
OP should then post his full HTML, and update bs4 using pip... – O'Niel Jul 04 '16 at 22:09

score -2 · Answer 2 · edited May 23 '17 at 11:58

-2

If the lists are the same length it might be easier to interleave them, instead of writing code to get around the original formatting with Beautiful Soup

from itertools import chain

list_a = ['this is paragraph1', 'this is paragraph3']
list_b = ['this is paragraph2', 'this is paragraph4']

print(list(chain.from_iterable(zip(list_a, list_b))))


# ['this is paragraph1', 'this is paragraph2', 'this is paragraph3', 'this is paragraph4']

More info here: Interleaving Lists in Python

edited May 23 '17 at 11:58

Community

1
1

answered Jul 04 '16 at 21:43

Jack Evans

1,697
3
17
33

Thank you for your in-depth answer, please check out my edit, I added some newer information to the question. Sorry for the confusion. – return 0 Jul 04 '16 at 21:47
I totally missed that question! Thank you so much! – return 0 Jul 04 '16 at 21:50

Jack Evans · Answer 3 · 2016-07-05T08:15:41.583

-2

The following seems to work

import bs4

soup = bs4.BeautifulSoup("""
<div id='essay'>
this is paragraph1
<p>this is paragraph2</p>
this is paragraph3
<p>this is paragraph4</p>
</div>
""", "lxml")

main = soup.find('div', id='essay')
for child in main.children:
    print(child.string)

edited Jul 05 '16 at 08:15

answered Jul 04 '16 at 21:52

Jack Evans

1,697
3
17
33

You have to specify your parser in the function BeautifulSoup. Otherwise you may get unwanted behaviours in other systems, and a big warning. – O'Niel Jul 04 '16 at 22:05

Using BeautifulSoup4 to get string with/without tag inside section

3 Answers3

Using BeautifulSoup4 to get string with/without
tag inside
section