How to extract h1 tag from an HTML file with BeautifulSoup?

Question

I'am trying to extract the h1(or any header) header from an HTML file.

My python code is as below:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.le.ac.uk/oerresources/bdra/html/page_09.htm');
# print(html.read());

# using beautifulsoup
bs = BeautifulSoup(html, 'html.parser');
h2 = bs.find('h2', {'id' : 'toc'});
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]));
print(h2);

As you can see from the above snippet i have tried to extract all the headers but all i get is an empty list and None. I have checked the html file for headers and i have verified that they are present. I have also tried using dictionary like h2 = bs.find('h2', {'class' : 'toc'});

Can somebody tell me what is that i'm doing wrong here?

What version are you using, it works form me: `beautifulsoup4==4.7.1`. I can hind `h1` and `h2`. There isn't any h2 with id=toc. — Andrej Kesely, Jun 30 '19 at 06:03
The latest version? Type `pip freeze`, it will show the version. — Andrej Kesely, Jun 30 '19 at 06:06
That's old version. Update it to `4.7.1`. Because the code works for me (I'm on Python 3.6.8) — Andrej Kesely, Jun 30 '19 at 06:08
@AndrejKesely I have updated both bs and python to 4.7.1 and 3.7 respectively. Still i didn't get the correct output. — Midhun, Jun 30 '19 at 06:19
That's strange, because when I run the exact code you posted here I get `[
Introduction to HTML/XHTML
,
Table of Contents
,
Example HTML Document
]` and `None`. Try using `lxml` instead of `html.parser`. Try to `print(bs.prettify())` to see what is inside the soup. — Andrej Kesely, Jun 30 '19 at 06:23
Strangely it works now. Somehow it didn't, the first time i ran it. Thanks fro the help. — Midhun, Jun 30 '19 at 06:55

score 1 · Answer 1 · answered Jun 30 '19 at 06:07

I get the following output when I run the code:

[<h1>Introduction to HTML/XHTML</h1>, <h2><a href="index.htm" id="toc-title">Table of Contents</a></h2>, <h2>Example HTML Document</h2>]

Code I used:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.le.ac.uk/oerresources/bdra/html/page_09.htm').read().decode("utf-8")
# using beautifulsoup
bs = BeautifulSoup(html, 'html.parser')
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]))

urlopen gives you a http.client.HTTPResponse object you need to read that and then decode it to UTF-8.

This quesiton is probably a copy of -BeautifulSoup HTTPResponse has no attribute encode

How to extract h1 tag from an HTML file with BeautifulSoup?

Introduction to HTML/XHTML

Table of Contents

Example HTML Document

1 Answers1