What's the easiest way to scrape just the text from a handful of webpages (using a list of URLs) using BeautifulSoup? Is it even possible?
Best, Georgina
What's the easiest way to scrape just the text from a handful of webpages (using a list of URLs) using BeautifulSoup? Is it even possible?
Best, Georgina
import urllib2
import BeautifulSoup
import re
Newlines = re.compile(r'[\r\n]\s+')
def getPageText(url):
# given a url, get page content
data = urllib2.urlopen(url).read()
# parse as html structured document
bs = BeautifulSoup.BeautifulSoup(data, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
# kill javascript content
for s in bs.findAll('script'):
s.replaceWith('')
# find body and extract text
txt = bs.find('body').getText('\n')
# remove multiple linebreaks and whitespace
return Newlines.sub('\n', txt)
def main():
urls = [
'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
]
txt = [getPageText(url) for url in urls]
if __name__=="__main__":
main()
It now removes javascript and decodes html entities.
It is perfectly possible. Easiest way is to iterate through list of URLs, load the content, find the URLs, add them to main list. Stop iteration when enough pages are found.
Just some tips:
urllib2.urlopen
for fetching contentBeautifulSoup
: findAll('a') for finding URLs