2

How can one extract the things/ the content you see on a Webpage into a String For Example turning this:

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>myWebpage</title>
</head>
<body>
    <p>this</p>
    <p>is</p>
    <p>an</p>
    <p>example</p>
</body>
</html>

Into a string that looks like this:

this is an example

2 Answers2

1

This program does what you want : https://github.com/Alir3z4/html2text

You can also try something like:

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

This for example extracts the text from this webpage.

0

You can use selenium, find the documentations here: https://pypi.org/project/selenium/