0

So, I wrote a minimal function to scrape all the text from a webpage:

url = 'http://www.brainpickings.org'
request = requests.get(url)
soup_data = BeautifulSoup(request.content)
texts = soup_data.findAll(text=True)

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False

    return True
print filter(visible,texts)

But, it doesn't work that smooth. There are still unnecessary tags that are there. Also, if I try to to do a reg-ex removal of various characters that I don't want, I get an

error     elif re.match('<!--.*-->', str(element)):
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 209: ordinal not in range(128)

Thus, how can I improve this a bit more to make it better?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Hick
  • 35,524
  • 46
  • 151
  • 243
  • 2
    English meaning nazi: "scrapping" is the process of turning a car into scrap metal; to discard or remove from service. You probably meant "scraping, to scrape", to remove an outer layer with a tool. Corrected your post for you :-) – Martijn Pieters Jun 05 '12 at 09:06
  • 1
    Do not use Regex for HTML parsing, see [why](http://stackoverflow.com/a/1732454/851737). – schlamar Jun 05 '12 at 09:14
  • Use [splinter zope](http://splinter.cobrateam.info/docs/drivers/zope.testbrowser.html) .Easy to use. – Priyank Patel Jun 05 '12 at 09:19

1 Answers1

1

With lxml this is pretty easy:

from lxml import html

doc = html.fromstring(content)
print doc.text_content()

Edit: Filtering the head could be done as follows:

print doc.body.text_content()
schlamar
  • 9,238
  • 3
  • 38
  • 76