python list.append between text

Question

In Python 3, how would you go about taking the string between header tags, for example, printing Hello, world!, out of <h1>Hello, world!</h1>:

import urllib
from urllib.request import urlopen

#example URL that includes an <h> tag: http://www.hobo-web.co.uk/headers/
userAddress = input("Enter a website URL: ")

webPage = urllib.request.urlopen(userAddress)

list = []

while webPage != "":
    webPage.read()
    list.append()

score 2 · Answer 1 · edited May 23 '17 at 12:23

You need an HTML Parser. For example, BeautifulSoup:

from bs4 import BeautifulSoup

soup = BeautifulSoup(webPage)
print(soup.find("h1").get_text(strip=True))

Demo:

>>> from urllib.request import urlopen
>>> from bs4 import BeautifulSoup
>>>
>>> url = "http://www.hobo-web.co.uk/headers/"
>>> webPage = urlopen(url)
>>>
>>> soup = BeautifulSoup(webPage, "html.parser")
>>> print(soup.find("h1").get_text(strip=True))
How To Use H1-H6 HTML Elements Properly

I'm not allowed to use any additional libraries, aside from what comes with python. Does python come with the ability to parse HTML, albeit in a less efficient way?

If you are, for some reason, not allowed to use third-parties, you can use a built-in html.parser module. Some people also use regular expressions to parse HTML. It is not always a bad thing, but you have to be very careful with that, see:

RegEx match open tags except XHTML self-contained tags

I'm not allowed to use any additional libraries, aside from what comes with python. Does python come with the ability to parse HTML, albeit in a less efficient way? — Cameron, Dec 13 '15 at 23:40

score 0 · Answer 2 · edited May 23 '17 at 12:15

0

Definitely HTMLParser is your best friend to deal with that issue.

There are related question which already exist and cover your needs.

edited May 23 '17 at 12:15

Community

1
1

answered Dec 13 '15 at 23:33

Andriy Ivaneyko

20,639
6
60
82

python list.append between text

2 Answers2