What is the best way to extract the content of an HTML file into a String? (in Python)

Question

How can one extract the things/ the content you see on a Webpage into a String For Example turning this:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>myWebpage</title>
</head>
<body>
    <p>this</p>
    <p>is</p>
    <p>an</p>
    <p>example</p>
</body>
</html>

Into a string that looks like this:

this is an example

Use [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) package. — Corralien, Jun 01 '21 at 21:00

score 1 · Answer 1 · answered Jun 01 '21 at 21:02

This program does what you want : https://github.com/Alir3z4/html2text

You can also try something like:

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

This for example extracts the text from this webpage.

score 0 · Answer 2 · answered Jun 01 '21 at 21:08

0

You can use selenium, find the documentations here: https://pypi.org/project/selenium/

answered Jun 01 '21 at 21:08

Roberto Cherchi

1
1

What is the best way to extract the content of an HTML file into a String? (in Python)

2 Answers2