0

I am trying to parse all elements under div using beautifulsoup the issue is that I don't know all the elements underneath the div prior to parsing. For example a div can have text data in paragraph mode and bullet format along with some href elements. Each url that I open can have different elements underneath the specific div class that I am looking at:

example:

url a can have following:

<div class='content'>
<p> Hello I have a link </p>

<li> I have a bullet point

<a href="foo.com">foo</a>
</div>

but url b

can have

<div class='content'>
<p> I only have paragraph </p>

</div>

I started as doing something like this:

content = souping_page.body.find('div', attrs={'class': 'content})

but how to go beyond this is little confuse. I was hoping to create one string from all the parse data as a end result.

At the end I want the following string to be obtain from each example:

Example 1: Final Output

 parse_data = Hello I have a link I have a bullet point 
 parse_links = foo.com

Example 2: Final Output

 parse_data = I only have paragraph  
add-semi-colons
  • 18,094
  • 55
  • 145
  • 232

2 Answers2

2

You can get just the text of a text with element.get_text():

>>> from bs4 import BeautifulSoup
>>> sample1 = BeautifulSoup('''\
... <div class='content'>
... <p> Hello I have a link </p>
... 
... <li> I have a bullet point
... 
... <a href="foo.com">foo</a>
... </div>
... ''').find('div')
>>> sample2 = BeautifulSoup('''\
... <div class='content'>
... <p> I only have paragraph </p>
... 
... </div>
... ''').find('div')
>>> sample1.get_text()
u'\n Hello I have a link \n I have a bullet point\n\nfoo\n'
>>> sample2.get_text()
u'\n I only have paragraph \n'

or you can strip it down a little using element.stripped_strings:

>>> ' '.join(sample1.stripped_strings)
u'Hello I have a link I have a bullet point foo'
>>> ' '.join(sample2.stripped_strings)
u'I only have paragraph'

To get all links, look for all a elements with href attributes and gather these in a list:

>>> [a['href'] for a in sample1.find_all('a', href=True)]
['foo.com']
>>> [a['href'] for a in sample2.find_all('a', href=True)]
[]

The href=True argument limits the search to <a> tags that have a href attribute defined.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

Per the Beautiful Soup docs, to iterate over the children of a tag use either .contents to get them as a list or .children (a generator).

for child in title_tag.children:
    print(child)

So, in your case, for example, you grab the .text of each tag and concatenate it together. I'm not clear on whether you want the link location or simply the label, if the former, refer to this SO question.

Community
  • 1
  • 1
Anov
  • 2,272
  • 2
  • 20
  • 26