10

I'm trying to scrape a speech from a website using BeautifulSoup. I'm encountering problems, however, since the speech is divided into many different paragraphs. I'm extremely new to programming and am having trouble figuring out how to deal with this. The HTML of the page looks like this:

<span class="displaytext">Thank you very much. Mr. Speaker, Vice President Cheney, 
Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is    
at war; our economy is in recession; and the civilized world faces unprecedented dangers. 
Yet, the state of our Union has never been stronger.
<p>We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, 
begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and  
rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, 
saved a people from starvation, and freed a country from brutal oppression. 
<p>The American flag flies again over our Embassy in Kabul. Terrorists who once occupied 
Afghanistan now occupy cells at Guantanamo Bay. And terrorist leaders who urged followers to 
sacrifice their lives are running for their own.

It continues on like that for awhile, with multiple paragraph tags. I'm trying to extract all of the text within the span.

I've tried a couple of different ways to get the text, but both have failed to get the text that I want.

The first I tried is:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
print thespan.string

which gives me:

Mr. Speaker, Vice President Cheney, Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is at war; our economy is in recession; and the civilized world faces unprecedented dangers. Yet, the state of our Union has never been stronger.

That is the portion of the text up until the first paragraph tag. I then tried:

import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
html = urllib2.urlopen(address).read()

soup = BeautifulSoup(html)
thespan = soup.find('span', attrs={'class': 'displaytext'})
for section in thespan:
     paragraph = section.findNext('p')
     if paragraph and paragraph.string:
         print '>', paragraph.string
     else:
         print '>', section.parent.next.next.strip()

This gave me the text between the first paragraph tag and the second paragraph tag. So, I'm looking for a way to get the entire text, instead of just sections.

user1074057
  • 1,772
  • 5
  • 20
  • 30

3 Answers3

8
import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())

span = soup.find("span", {"class":"displaytext"})  # span.string gives you the first bit
paras = [x.contents[0] for x in span.findAllNext("p")]  # this gives you the rest
# use .contents[0] instead of .string to deal with last para that's not well formed

print "%s\n\n%s" % (span.string, "\n\n".join(paras))

As pointed out in the comments, the above does not work so well if the <p> tags contain more nested tags. This can be dealt with using:

paras = ["".join(x.findAll(text=True)) for x in span.findAllNext("p")]

However, that doesn't work too well with the last <p> that does not have a closing tag. A hacky workaround would be to treat that differently. For example:

import urllib2,sys
from BeautifulSoup import BeautifulSoup

address = 'http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW'
soup = BeautifulSoup(urllib2.urlopen(address).read())
span = soup.find("span", {"class":"displaytext"})  
paras = [x for x in span.findAllNext("p")]

start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s" % (start, middle, last)
Shawn Chin
  • 84,080
  • 19
  • 162
  • 191
  • This does not work with the web-page linked to in the question (i.e. it will only print the first paragraph - not the whole speech). – ekhumoro Nov 30 '11 at 19:56
  • 1
    I ought to thank you as well. It was an interesting exercise. ;) – Shawn Chin Nov 30 '11 at 20:48
  • @user1074057. Not _quite_ perfect. There's a bit missing at the end of the paragraph starting "Good jobs begin with good schools...". – ekhumoro Nov 30 '11 at 21:08
  • 1
    @ekhumoro You're hard to please! That line has more nested tags which can be dealt with if we use `paras = ["".join(x.findAll(text=True)) for x in span.findAllNext("p")]`, but this breaks the last para which is not well formed. Will look for a more robust approach when I get a chance to. – Shawn Chin Nov 30 '11 at 21:21
  • @ShawnChin. Well, I think your current solution is "good enough" (if not 100% perfect) - so I gave you an upvote anyway ;-) – ekhumoro Nov 30 '11 at 21:26
  • of all the "BS" solutions for getting span/class/blah, this one works for me. thanks – philshem Oct 11 '13 at 09:30
2

Here's how it would be done with lxml:

import lxml.html as lh

tree = lh.parse('http://www.presidency.ucsb.edu/ws/index.php?pid=29644&st=&st1=#axzz1fD98kGZW')

text = tree.xpath("//span[@class='displaytext']")[0].text_content()

Alternatively, the answers to this question covers how to achieve the same thing using beautifulsoup: BeautifulSoup - easy way to to obtain HTML-free contents

The helper function from the accepted answer:

def textOf(soup):
    return u''.join(soup.findAll(text=True))
Community
  • 1
  • 1
Acorn
  • 49,061
  • 27
  • 133
  • 172
  • 1
    Perhaps let the op know why lxml is a good alternative to BeautifulSoup :) – Derek Litz Nov 30 '11 at 19:59
  • Neither of these suggestions will produce the output asked for in the question. – ekhumoro Nov 30 '11 at 20:00
  • @ekhumoro, could you please explain in what way my solution fails to produce the desired output? The OP wants to `"...extract all of the text within the span"`, and that is what the above code does.. – Acorn Nov 30 '11 at 20:44
  • 1
    Try it using the web-page linked to in the question (hint: you need to get just the speech - and the whole of it, not just the first paragraph). – ekhumoro Nov 30 '11 at 20:58
  • My initial code was addressing the problem of extracting all the text, not of selecting the span element itself from within the document. It was meant to be fed the snippet that the OP posted at the beginning of his question. I've changed my answer to show how one would extract the correct span element from the document. – Acorn Nov 30 '11 at 21:20
0

You should try:

soup.span.renderContents()
Ofir Farchy
  • 7,657
  • 7
  • 38
  • 58
  • `.renderContents()` does not do what the OP wants. It doesn't remove the paragraph tags. – Acorn Nov 30 '11 at 19:45