Python: obtaining passages from html

Question

I am iterating through a list of links to obtain all obama's speeches. However, for some links, their html format is like the following:

<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">If 
              there is anyone out there who still doubts that America is a place 
              where all things are possible; who still wonders if the dream of 
              our founders is alive in our time; who still questions the power 
              of our democracy, tonight is your answer.</font></p>
<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">It’s 
              the answer told by lines that stretched around schools and churches 
              in numbers this nation has never seen; by people who waited three 
              hours and four hours, many for the very first time in their lives, 
              because they believed that this time must be different; that their 
              voice could be that difference.</font></p>
<p><font face="Verdana, Arial, Helvetica, sans-serif" size="3">It’s 
              the answer spoken by young and old, rich and poor, Democrat and 
              Republican, black, white, Latino, Asian, Native American, gay, straight, 
              disabled and not disabled – Americans who sent a message to 
              the world that we have never been a collection of Red States and 
              Blue States: we are, and always will be, the United States of America.</font></p>

And if I do soup.find_all('font'), I only get one of the paragraphs but not the whole passage. However, for other links, their html format may look like the text below,which soup.find_all('font') returns the whole passage to me.

</font></strong><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><br/>
</font></font><font face="Verdana, Arial, Helvetica, sans-serif" size="3"><br/>
            My fellow citizens:<br/>
<br/>
            I stand here today humbled by the task before us, grateful for the 
            trust you have bestowed, mindful of the sacrifices borne by our ancestors. 
            I thank President Bush for his service to our nation, as well as the 
            generosity and cooperation he has shown throughout this transition.<br/>
<br/>
            Forty-four Americans have now taken the presidential oath. The words 
            have been spoken during rising tides of prosperity and the still waters 
            of peace. Yet, every so often the oath is taken amidst gathering clouds 
            and raging storms. At these moments, America has carried on not simply 
            because of the skill or vision of those in high office, but because 
            We the People have remained faithful to the ideals of our forbearers, 
            and true to our founding documents.<br/>
<br/>
            So it has been. So it must be with this generation of Americans.<br/>
</font> <div align="left">

Goal: I want to obtain the entire speech not just paragraphs. How can I achieve this using beautifulsoup in python ?

These two speeches come from:

http://obamaspeeches.com/E11-Barack-Obama-Election-Night-Victory-Speech-Grant-Park-Illinois-November-4-2008.htm

http://obamaspeeches.com/P-Obama-Inaugural-Speech-Inauguration.htm

score 1 · Accepted Answer · answered Jun 30 '14 at 12:10

1

Unfortunately, since they aren't necessarily standard - it makes a little more work for you as 1 logic flow won't hit them all.

However, for the specific case you list, you could do either of the following:

Select a containing parent of the font tags, i.e. table. (Note: You'll need some logic to verify which table contains what you want, since that website uses table layouts)

for table in soup.find_all('table'):
    if this_is_the_table_you_want:
        print(table.text)

-or-

Simply build the string from the tags you already have

speech_text = ""
for font in soup.find_all('font'):
    speech_text += font.text

answered Jun 30 '14 at 12:10

nerdwaller

1,813
1
18
19

What about this page ? http://mittromneycentral.com/speeches/2012-speeches/031912-remarks-in-chicago-the-freedom-to-dream/ `table` and `font` don't work at all. What is the best way to extract the speech out of it. Thanks – mynameisJEFF Jun 30 '14 at 13:58
It's bad form on stack overflow to keep building your question. If someone answers your original question, you should accept it and hopefully learn from their suggestions. If you try for a while after that and cannot adapt the lesson on your own, you should open a new question and post what you tried. – nerdwaller Jun 30 '14 at 14:01
I am sorry but this is a very similar question. That's why I want to ask if you would have any ideas to handle situation like this. – mynameisJEFF Jun 30 '14 at 14:06
@Chinegro - This one is easier since you have an `id`, which should be globally unique within the page (technically speaking, but it may not be). So you can add qualifiers on your `.find()`. Such as: `soup.find('div', {'id', 'post-54617'}`. See [here](http://stackoverflow.com/questions/2136267/beautiful-soup-and-extracting-a-div-and-its-contents-by-id) for an example. – nerdwaller Jun 30 '14 at 14:09
How did you obtain the ID in Python? Because I have a list of websites to iterate through , I am guessing there is a different ID for different links. – mynameisJEFF Jun 30 '14 at 14:12
I did it manually for that. If you want it fully programmatic you will need to do some logic checking the properties of components. If that's too much work, than one option is to group them by site (as the site probably has a consistent layout). – nerdwaller Jun 30 '14 at 14:20
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/56558/discussion-between-chinegro-and-nerdwaller). – mynameisJEFF Jun 30 '14 at 16:42

Python: obtaining passages from html

1 Answers1