Beautiful Soup and extracting a div and its contents by ID

Question

soup.find("tagName", { "id" : "articlebody" })

Why does this NOT return the <div id="articlebody"> ... </div> tags and stuff in between? It returns nothing. And I know for a fact it exists because I'm staring right at it from

soup.prettify()

soup.find("div", { "id" : "articlebody" }) also does not work.

(EDIT: I found that BeautifulSoup wasn't correctly parsing my page, which probably meant the page I was trying to parse isn't properly formatted in SGML or whatever)

(To your EDIT, this question still has value as a reusable resource to others, even if the parser doesn't work on your particular page) — smci, Jun 07 '20 at 01:14

Lukáš Lalinský · Accepted Answer · 2010-01-25T23:02:11.190

298

You should post your example document, because the code works fine:

>>> import BeautifulSoup
>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div id="articlebody"> ... </div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

Finding <div>s inside <div>s works as well:

>>> soup = BeautifulSoup.BeautifulSoup('<html><body><div><div id="articlebody"> ... </div></div></body></html')
>>> soup.find("div", {"id": "articlebody"})
<div id="articlebody"> ... </div>

edited Jan 25 '10 at 23:02

answered Jan 25 '10 at 22:55

Lukáš Lalinský

40,587
6
104
126

3

my example document is enormous. i'm tracking down the problem - i think this doesn't work on divs of divs. I did a count of how many divs are in the document with print len(soup('div')) which resulted in 10, and i can CLEARLY see more than 10 divs with firebug. so i think it just can't find divs inside divs, so i need to narrow things down wrapper by wrapper. – Tony Stark Jan 25 '10 at 22:59
10

Well, then it's impossible to answer your question, crystal balls are not a reliable way of debugging. :) – Lukáš Lalinský Jan 25 '10 at 23:00
1

I tried this code. the div has and I 'cant print the embed inside it. – Vincent Dec 03 '13 at 08:06
26

or more simply [`div = soup.find(id="articlebody")`](http://stackoverflow.com/a/22410466/4279) – jfs May 05 '14 at 17:39
5

or `soup.find('div', id='articlebody')` – Trevor Boyd Smith Nov 11 '16 at 21:19
1

I'd just point out when installing, install `pip install beautifulsoup4` because command without the 4 installs version 3 instead of 4. https://beautiful-soup-4.readthedocs.io/en/latest/#installing-beautiful-soup – Jan Sila Jul 17 '17 at 12:49

score 125 · Answer 2 · answered Mar 14 '14 at 16:17

125

To find an element by its id:

div = soup.find(id="articlebody")

answered Mar 14 '14 at 16:17

jfs

399,953
195
994
1,670

1

Maybe this is for an old version? `Exception: TypeError: find() takes no keyword arguments` – boatcoder Apr 15 '22 at 20:25
1

@boatcoder it works with the latest version (4.11.1) – jfs Apr 16 '22 at 21:46
https://replit.com/@zed1/bs4-find – jfs Apr 17 '22 at 11:44
You get `TypeError: find() takes no keyword arguments` when you call `find()` on a string, not a parsed BS object. – tetafro Mar 11 '23 at 20:02

Josh Crozier · Answer 3 · 2017-02-20T05:50:27.807

Beautiful Soup 4 supports most CSS selectors with the .select() method, therefore you can use an id selector such as:

soup.select('#articlebody')

If you need to specify the element's type, you can add a type selector before the id selector:

soup.select('div#articlebody')

The .select() method will return a collection of elements, which means that it would return the same results as the following .find_all() method example:

soup.find_all('div', id="articlebody")
# or
soup.find_all(id="articlebody")

If you only want to select a single element, then you could just use the .find() method:

soup.find('div', id="articlebody")
# or
soup.find(id="articlebody")

score 17 · Answer 4 · answered Mar 04 '10 at 03:34

I think there is a problem when the 'div' tags are too much nested. I am trying to parse some contacts from a facebook html file, and the Beautifulsoup is not able to find tags "div" with class "fcontent".

This happens with other classes as well. When I search for divs in general, it turns only those that are not so much nested.

The html source code can be any page from facebook of the friends list of a friend of you (not the one of your friends). If someone can test it and give some advice I would really appreciate it.

This is my code, where I just try to print the number of tags "div" with class "fcontent":

from BeautifulSoup import BeautifulSoup 
f = open('/Users/myUserName/Desktop/contacts.html')
soup = BeautifulSoup(f) 
list = soup.findAll('div', attrs={'class':'fcontent'})
print len(list)

Could be a bug. Log an issue perhaps? – progonkpa Feb 01 '22 at 02:07 — progonkpa, Feb 01 '22 at 02:07

score 9 · Answer 5 · answered Jan 29 '13 at 16:20

9

Most probably because of the default beautifulsoup parser has problem. Change a different parser, like 'lxml' and try again.

answered Jan 29 '13 at 16:20

liang

1,571
1
20
22

This worked for me, thanks! I used `soup = BeautifulSoup(data, parser="html.parser")` – will-hart Jun 10 '14 at 22:11

dagoof · Answer 6 · 2010-01-25T23:14:20.227

8

In the beautifulsoup source this line allows divs to be nested within divs; so your concern in lukas' comment wouldn't be valid.

NESTABLE_BLOCK_TAGS = ['blockquote', 'div', 'fieldset', 'ins', 'del']

What I think you need to do is to specify the attrs you want such as

source.find('div', attrs={'id':'articlebody'})

edited Jan 25 '10 at 23:14

answered Jan 25 '10 at 23:05

dagoof

1,137
11
14

score 5 · Answer 7 · answered Jan 25 '10 at 23:00

5

have you tried soup.findAll("div", {"id": "articlebody"})?

sounds crazy, but if you're scraping stuff from the wild, you can't rule out multiple divs...

answered Jan 25 '10 at 23:00

user106514

161
6

score 5 · Answer 8 · answered Jan 25 '10 at 23:02

I used:

soup.findAll('tag', attrs={'attrname':"attrvalue"})

As my syntax for find/findall; that said, unless there are other optional parameters between the tag and attribute list, this shouldn't be different.

score 4 · Answer 9 · answered Jan 25 '10 at 23:03

4

Here is a code fragment

soup = BeautifulSoup(:"index.html")
titleList = soup.findAll('title')
divList = soup.findAll('div', attrs={ "class" : "article story"})

As you can see I find all tags and then I find all tags with class="article" inside

answered Jan 25 '10 at 23:03

Recursion

2,915
8
38
51

score 4 · Answer 10 · answered Apr 30 '15 at 05:34

4

Happened to me also while trying to scrape Google.
I ended up using pyquery.
Install:

pip install pyquery

Use:

from pyquery import PyQuery    
pq = PyQuery('<html><body><div id="articlebody"> ... </div></body></html')
tag = pq('div#articlebody')

answered Apr 30 '15 at 05:34

Shoham

7,014
8
40
40

score 4 · Answer 11 · answered May 11 '20 at 10:40

4

The Id property is always uniquely identified. That means you can use it directly without even specifying the element. Therefore, it is a plus point if your elements have it to parse through the content.

divEle = soup.find(id = "articlebody")

answered May 11 '20 at 10:40

Iqra.

685
1
7
18

score 2 · Answer 12 · answered Aug 23 '20 at 06:34

from bs4 import BeautifulSoup
from requests_html import HTMLSession

url = 'your_url'
session = HTMLSession()
resp = session.get(url)

# if element with id "articlebody" is dynamic, else need not to render
resp.html.render()

soup = bs(resp.html.html, "lxml")
soup.find("div", {"id": "articlebody"})

score -3 · Answer 13 · edited Oct 31 '20 at 11:09

-3

soup.find("tagName",attrs={ "id" : "articlebody" })

edited Oct 31 '20 at 11:09

Zoe

27,060
21
118
148

answered Oct 31 '20 at 11:03

Shri narayan

9

1

provide more explanation to your answer – bhucho Oct 31 '20 at 14:46
1

Welcome to Stack Overflow. While this code may answer the question, providing additional context regarding why and/or how this code answers the question improves its long-term value. [How to Answer](https://stackoverflow.com/help/how-to-answer) – Elletlar Oct 31 '20 at 16:52
bad answer: `TypeError: find() takes no keyword arguments` – loretoparisi Jan 26 '21 at 09:57

Beautiful Soup and extracting a div and its contents by ID

13 Answers13

Linked