Python - Easiest way to scrape text from list of URLs using BeautifulSoup

Question

What's the easiest way to scrape just the text from a handful of webpages (using a list of URLs) using BeautifulSoup? Is it even possible?

Best, Georgina

Hugh Bothwell · Accepted Answer · 2011-03-16T21:13:20.453

import urllib2
import BeautifulSoup
import re

Newlines = re.compile(r'[\r\n]\s+')

def getPageText(url):
    # given a url, get page content
    data = urllib2.urlopen(url).read()
    # parse as html structured document
    bs = BeautifulSoup.BeautifulSoup(data, convertEntities=BeautifulSoup.BeautifulSoup.HTML_ENTITIES)
    # kill javascript content
    for s in bs.findAll('script'):
        s.replaceWith('')
    # find body and extract text
    txt = bs.find('body').getText('\n')
    # remove multiple linebreaks and whitespace
    return Newlines.sub('\n', txt)

def main():
    urls = [
        'http://www.stackoverflow.com/questions/5331266/python-easiest-way-to-scrape-text-from-list-of-urls-using-beautifulsoup',
        'http://stackoverflow.com/questions/5330248/how-to-rewrite-a-recursive-function-to-use-a-loop-instead'
    ]
    txt = [getPageText(url) for url in urls]

if __name__=="__main__":
    main()

It now removes javascript and decodes html entities.

score 1 · Answer 2 · answered Mar 16 '11 at 20:35

1

It is perfectly possible. Easiest way is to iterate through list of URLs, load the content, find the URLs, add them to main list. Stop iteration when enough pages are found.

Just some tips:

urllib2.urlopen for fetching content
BeautifulSoup: findAll('a') for finding URLs

answered Mar 16 '11 at 20:35

Jiri

16,425
6
52
68

Hi @Jiri -- do you mean "find the HTML" ? – Georgina Mar 16 '11 at 20:37
1

Ok, you dont need to traverse site by URLs in pages. Just to strip text. You can try ''.join(soup.findAll(text=True)) – Jiri Mar 16 '11 at 20:43

score 1 · Answer 3 · answered Mar 16 '11 at 21:09

1

I know that it is not an answer to your exact question (about BeautifulSoup) but a good idea is to have a look at Scrapy which seems to fit yous needs.

answered Mar 16 '11 at 21:09

philnext

3,242
5
39
62

Python - Easiest way to scrape text from list of URLs using BeautifulSoup

3 Answers3

Linked