213
soup.find("tagName", { "id" : "articlebody" })

Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from

soup.prettify()

soup.find("div", { "id" : "articlebody" }) also does not work.

(EDIT: I found that BeautifulSoup wasn't correctly parsing my page, which probably meant the page I was trying to parse isn't properly formatted in SGML or whatever)

smci
  • 32,567
  • 20
  • 113
  • 146
Tony Stark
  • 24,588
  • 41
  • 96
  • 113
  • (To your EDIT, this question still has value as a reusable resource to others, even if the parser doesn't work on your particular page) – smci Jun 07 '20 at 01:14

13 Answers13

298

You should post your example document, because the code works fine:

>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

Finding <div>s inside <div>s works as well:

>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>
Lukáš Lalinský
  • 40,587
  • 6
  • 104
  • 126
  • 3
    my example document is enormous. i'm tracking down the problem - i think this doesn't work on divs of divs. I did a count of how many divs are in the document with print len(soup('div')) which resulted in 10, and i can CLEARLY see more than 10 divs with firebug. so i think it just can't find divs inside divs, so i need to narrow things down wrapper by wrapper. – Tony Stark Jan 25 '10 at 22:59
  • 10
    Well, then it's impossible to answer your question, crystal balls are not a reliable way of debugging. :) – Lukáš Lalinský Jan 25 '10 at 23:00
  • 1
    I tried this code. the div has and I 'cant print the embed inside it. – Vincent Dec 03 '13 at 08:06
  • 26
    or more simply [`div = soup.find(id="articlebody")`](http://stackoverflow.com/a/22410466/4279) – jfs May 05 '14 at 17:39
  • 5
    or `soup.find('div', id='articlebody')` – Trevor Boyd Smith Nov 11 '16 at 21:19
  • 1
    I'd just point out when installing, install `pip install beautifulsoup4` because command without the 4 installs version 3 instead of 4. https://beautiful-soup-4.readthedocs.io/en/latest/#installing-beautiful-soup – Jan Sila Jul 17 '17 at 12:49
125

To find an element by its id:

div = soup.find(id="articlebody")
jfs
  • 399,953
  • 195
  • 994
  • 1,670
46

Beautiful Soup 4 supports most CSS selectors with the .select() method, therefore you can use an id selector such as:

soup.select('#articlebody')

If you need to specify the element's type, you can add a type selector before the id selector:

soup.select('div#articlebody')

The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:

soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")

If you only want to select a single element, then you could just use the .find() method:

soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")
Josh Crozier
  • 233,099
  • 56
  • 391
  • 304
17

I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".

This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

This is my code, where I just try to print the number of tags "div" with class "fcontent":

from BeautifulSoup import BeautifulSoup 
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)
omar
  • 171
  • 2
9

Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.

liang
  • 1,571
  • 1
  • 20
  • 22
8

In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.

NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']

What I think you need to do is to specify the attrs you want such as

source.find('div', attrs={'id':'articlebody'})
dagoof
  • 1,137
  • 11
  • 14
5

have you tried soup.findAll("div", {"id": "articlebody"})?

sounds crazy, but if you're scraping stuff from the wild, you can't rule out multiple divs...

user106514
  • 161
  • 6
5

I used:

soup.findAll('tag', attrs={'attrname':"attrvalue"})

As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn't be different.

4

Here is a code fragment

soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})

As you can see I find all tags and then I find all tags with class="article" inside

Recursion
  • 2,915
  • 8
  • 38
  • 51
4

Happened to me also while trying to scrape Google.
I ended up using pyquery.
Install:

pip install pyquery

Use:

from pyquery import PyQuery    
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')
Shoham
  • 7,014
  • 8
  • 40
  • 40
4

The Id property is always uniquely identified. That means you can use it directly without even specifying the element. Therefore, it is a plus point if your elements have it to parse through the content.

divEle = soup.find(id = "articlebody")
Iqra.
  • 685
  • 1
  • 7
  • 18
2
from bs4 import BeautifulSoup
from requests_html import HTMLSession

url = 'your_url'
session = HTMLSession()
resp = session.get(url)

# if element with id "articlebody" is dynamic, else need not to render
resp.html.render()

soup = bs(resp.html.html, "lxml")
soup.find("div", {"id": "articlebody"})
bot8080
  • 46
  • 3
-3
soup.find("tagName",attrs={ "id" : "articlebody" })
Zoe
  • 27,060
  • 21
  • 118
  • 148
  • 1
    provide more explanation to your answer – bhucho Oct 31 '20 at 14:46
  • 1
    Welcome to Stack Overflow. While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. [How to Answer](https://stackoverflow.com/help/how-to-answer) – Elletlar Oct 31 '20 at 16:52
  • bad answer: `TypeError: find() takes no keyword arguments` – loretoparisi Jan 26 '21 at 09:57