How to get the text present on a webpage using python?

Question

import urllib3
from bs4 import BeautifulSoup
url = 'http://www.thefamouspeople.com/singers.php'
http = urllib3.PoolManager()
response = http.request('GET', url)
print(response.data)

I am using python version 3.5.2 I am unable to install urllib or urllib2 to use urlopen function.Getting the output as "No suitable versions found".

The output I am getting for above code is the source code that we get when we do "Inspect source code". I want the output as:

The last natural blondes will die out within 200 years, scientists believe.
A study by experts in Germany suggests people with blonde hair are an                 
endangered species and will become extinct by 2202.

Researchers predict the last truly natural blonde will be born in Finland - 
the country with the highest proportion of blondes.


The frequency of blondes may drop but they won't disappear

Prof Jonathan Rees, University of Edinburgh
But they say too few people now carry the gene for blondes to last beyond 
the next two centuries.

The problem is that blonde hair is caused by a recessive gene.

In order for a child to have blonde hair, it must have the gene on both 
sides of the family in the grandparents' generation.

Dyed rivals

The researchers also believe that so-called bottle blondes may be to blame 
for the demise of their natural rivals.

They suggest that dyed-blondes are more attractive to men who choose them as 
partners over true blondes.

Tory MP Ann Widdecombe
Bottle-blondes like Ann Widdecombe may be to blame
But Jonathan Rees, professor of dermatology at the University of Edinburgh 
said it was unlikely blondes would die out completely.

"Genes don't die out unless there is a disadvantage of having that gene or 
by chance. They don't disappear," he told BBC News Online.

"The only reason blondes would disappear is if having the gene was a 
disadvantage and I do not think that is the case.

"The frequency of blondes may drop but they won't disappear."


See also:

28 Mar 01 | Education
What is it about blondes?
09 Apr 99 | Health
Platinum blondes are labelled as dumb
17 Apr 02 | Health
Hair dye cancer alert
Internet links:

University of Edinburgh

The BBC is not responsible for the content of external internet sites
Top Health stories now:

Heart risk link to big families
Back pain drug 'may aid diabetics'
Congo Ebola outbreak confirmed
Vegetables ward off Alzheimer's
Polio campaign launched in Iraq
Gene defect explains high blood pressure
Botox 'may cause new wrinkles'
Alien 'abductees' show real symptoms

Links to more Health stories are at the foot of the page.

This is the content present in the website http://www.thefamouspeople.com/singers.php I need help to get it.

I recommend you to use the scrapy framework. Very easy to use, flexible and saves you a lot of time. — Praind, Nov 27 '17 at 12:53
[BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) could be what you need to extract text from a HTML page. — Serge Ballesta, Nov 27 '17 at 12:58
import urllib3 from bs4 import BeautifulSoup url = 'http://www.thefamouspeople.com/singers.php' http = urllib3.PoolManager() response = http.request('GET', url) soup = BeautifulSoup(response) print(soup.get_text().strip()) — vinayak, Nov 27 '17 at 13:06
Possible duplicate of [Extracting text from HTML file using Python](https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python) — tripleee, Nov 27 '17 at 13:21

score 0 · Answer 1 · answered Nov 30 '17 at 10:18

I know that is not what you are asking but why you don't use something already working? There are many services online that extract text from html pages. Here some examples: https://contentxtractor.com/

http://www.webcontentextractor.com/

https://scrapy.org/

here more: https://www.quora.com/What-are-some-good-free-web-scrapers-scraping-techniques

How to get the text present on a webpage using python?

1 Answers1