I'am trying to extract the h1(or any header) header from an HTML file.
My python code is as below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.le.ac.uk/oerresources/bdra/html/page_09.htm');
# print(html.read());
# using beautifulsoup
bs = BeautifulSoup(html, 'html.parser');
h2 = bs.find('h2', {'id' : 'toc'});
print(bs.find_all(["h1", "h2", "h3", "h4", "h5", "h6"]));
print(h2);
As you can see from the above snippet i have tried to extract all the headers but all i get is an empty list and None. I have checked the html
file for headers and i have verified that they are present. I have also tried using dictionary like h2 = bs.find('h2', {'class' : 'toc'});
Can somebody tell me what is that i'm doing wrong here?
Introduction to HTML/XHTML
,Table of Contents
,Example HTML Document
]` and `None`. Try using `lxml` instead of `html.parser`. Try to `print(bs.prettify())` to see what is inside the soup. – Andrej Kesely Jun 30 '19 at 06:23