-1

In Python 3, how would you go about taking the string between header tags, for example, printing Hello, world!, out of <h1>Hello, world!</h1>:

import urllib
from urllib.request import urlopen

#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/
userAddress = input("Enter a website URL: ")

webPage = urllib.request.urlopen(userAddress)

list = []

while webPage != "":
    webPage.read()
    list.append() 
Remi Guan
  • 21,506
  • 17
  • 64
  • 87
Cameron
  • 61
  • 4

2 Answers2

2

You need an HTML Parser. For example, BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(webPage)
print(soup.find("h1").get_text(strip=True))

Demo:

>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://www.hobo-web.co.uk/headers/"
>>> webPage = urlopen(url)
>>>
>>> soup = BeautifulSoup(webPage, "html.parser")
>>> print(soup.find("h1").get_text(strip=True))
How To Use H1-H6 HTML Elements Properly

I'm not allowed to use any additional libraries, aside from what comes with python. Does python come with the ability to parse HTML, albeit in a less efficient way?

If you are, for some reason, not allowed to use third-parties, you can use a built-in html.parser module. Some people also use regular expressions to parse HTML. It is not always a bad thing, but you have to be very careful with that, see:

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I'm not allowed to use any additional libraries, aside from what comes with python. Does python come with the ability to parse HTML, albeit in a less efficient way? – Cameron Dec 13 '15 at 23:40
0

Definitely HTMLParser is your best friend to deal with that issue.

There are related question which already exist and cover your needs.

Community
  • 1
  • 1
Andriy Ivaneyko
  • 20,639
  • 6
  • 60
  • 82