Python: Parse all elements under a div

Question

I am trying to parse all elements under div using beautifulsoup the issue is that I don't know all the elements underneath the div prior to parsing. For example a div can have text data in paragraph mode and bullet format along with some href elements. Each url that I open can have different elements underneath the specific div class that I am looking at:

example:

url a can have following:

<div class='content'>
<p> Hello I have a link </p>

<li> I have a bullet point

<a href="foo.com">foo</a>
</div>

but url b

can have

<div class='content'>
<p> I only have paragraph </p>

</div>

I started as doing something like this:

content = souping_page.body.find('div', attrs={'class': 'content})

but how to go beyond this is little confuse. I was hoping to create one string from all the parse data as a end result.

At the end I want the following string to be obtain from each example:

Example 1: Final Output

 parse_data = Hello I have a link I have a bullet point 
 parse_links = foo.com

Example 2: Final Output

 parse_data = I only have paragraph

It's unclear what you are looking for here. Did you want to get all textual data from a tag? Then just use `tag.text`. — Martijn Pieters, Mar 10 '14 at 19:04
What does that *mean*? Can you give us an example of the expected output? — Martijn Pieters, Mar 10 '14 at 19:07
@MartijnPieters Pieters For example my SO profile has one line comparing to your SO profile where as its complex right... how about that example. — add-semi-colons, Mar 10 '14 at 19:10
@Anov yes all the child parse it and create a one text string obviously each word separate by space. — add-semi-colons, Mar 10 '14 at 19:12
@find-missing-semicolon: What has this got to do with our profiles? I am asking you what output you data you hoped to glean from both your samples. For example two variables, one with all text and another with all hyperlinks. — Martijn Pieters, Mar 10 '14 at 19:14
@MartijnPieters sorry I thought giving an example to what I am trying to parse since I am parsing internal website... yes that would work one with all text and other with all hyperlinks. — add-semi-colons, Mar 10 '14 at 19:18

score 2 · Accepted Answer · answered Mar 10 '14 at 19:31

You can get just the text of a text with element.get_text():

>>> from bs4 import BeautifulSoup
>>> sample1 = BeautifulSoup('''\
... <div class='content'>
... <p> Hello I have a link </p>
... 
... <li> I have a bullet point
... 
... <a href="foo.com">foo</a>
... </div>
... ''').find('div')
>>> sample2 = BeautifulSoup('''\
... <div class='content'>
... <p> I only have paragraph </p>
... 
... </div>
... ''').find('div')
>>> sample1.get_text()
u'\n Hello I have a link \n I have a bullet point\n\nfoo\n'
>>> sample2.get_text()
u'\n I only have paragraph \n'

or you can strip it down a little using element.stripped_strings:

>>> ' '.join(sample1.stripped_strings)
u'Hello I have a link I have a bullet point foo'
>>> ' '.join(sample2.stripped_strings)
u'I only have paragraph'

To get all links, look for all a elements with href attributes and gather these in a list:

>>> [a['href'] for a in sample1.find_all('a', href=True)]
['foo.com']
>>> [a['href'] for a in sample2.find_all('a', href=True)]
[]

The href=True argument limits the search to <a> tags that have a href attribute defined.

score 1 · Answer 2 · edited May 23 '17 at 12:13

Per the Beautiful Soup docs, to iterate over the children of a tag use either .contents to get them as a list or .children (a generator).

for child in title_tag.children:
    print(child)

So, in your case, for example, you grab the .text of each tag and concatenate it together. I'm not clear on whether you want the link location or simply the label, if the former, refer to this SO question.

Python: Parse all elements under a div

2 Answers2