4

I'm trying to get all text from a html tag using beautifulsoup get_text() method. I use Python 2.7 and Beautifulsoup 4.4.0. It works for most of the times. However, this method can only get first paragraph from a tag sometimes. I can't figure out why. Please see the following example.

from bs4 import BeautifulSoup
import urllib2

job_url = "http://www.indeed.com/viewjob?jk=0f5592c8191a21af"
site = urllib2.urlopen(job_url).read()
soup = BeautifulSoup(site, "html.parser")
text = soup.find("span", {"class": "summary"}).get_text()
print text

I want to get all content from this indeed job description. Basically, I want to get all text in . However, utilize the code above, I can only get "Please note that this is a 1 year contract assignment. Candidates cannot start an assignment until background check and drug test is completed". Why I'm losing the rest of text? How can I get all text from this tag without specifying sub-tags?

Thanks a lot.

Shengjie Zhang
  • 245
  • 4
  • 12

1 Answers1

3

Try it with a different parser like the lxml parser instead of the html.parser parser:

Replace:

soup = BeautifulSoup(site, "html.parser")

with:

soup = BeautifulSoup(site, "lxml")

Make sure you have the lxml parser installed first: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Joe Young
  • 5,749
  • 3
  • 28
  • 27
  • 1
    Thanks Joe, it works! Why lxml parser is better than html.parser for this task? What's the difference? – Shengjie Zhang Sep 19 '15 at 17:27
  • 1
    @alecxe Both parsers work for me as well. The html.parser I was using is probably out of date? – Shengjie Zhang Sep 19 '15 at 17:29
  • 2
    @ShengjieZhang nope, it's just that different parsers interpret that unreliable and broken soup of HTML tags differently, see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers – alecxe Sep 19 '15 at 17:31
  • @alecxe Thanks! Great resource. I'll go for html5lib. – Shengjie Zhang Sep 19 '15 at 17:38