1

I'am trying to extract the h1(or any header) header from an HTML file.

My python code is as below:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.le.ac.uk/oerresources/bdra/html/page_09.htm');
# print(html.read());

# using beautifulsoup
bs = BeautifulSoup(html, 'html.parser');
h2 = bs.find('h2', {'id' : 'toc'});
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]));
print(h2);

As you can see from the above snippet i have tried to extract all the headers but all i get is an empty list and None. I have checked the html file for headers and i have verified that they are present. I have also tried using dictionary like h2 = bs.find('h2', {'class' : 'toc'});

Can somebody tell me what is that i'm doing wrong here?

Midhun
  • 744
  • 2
  • 15
  • 31

1 Answers1

1

I get the following output when I run the code:

[<h1>Introduction to HTML/XHTML</h1>, <h2><a href="index.htm" id="toc-title">Table of Contents</a></h2>, <h2>Example HTML Document</h2>]

Code I used:

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('https://www.le.ac.uk/oerresources/bdra/html/page_09.htm').read().decode("utf-8")
# using beautifulsoup
bs = BeautifulSoup(html, 'html.parser')
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]))

urlopen gives you a http.client.HTTPResponse object you need to read that and then decode it to UTF-8.

This quesiton is probably a copy of -BeautifulSoup HTTPResponse has no attribute encode

Ashish Cherian
  • 367
  • 1
  • 3
  • 15