How to use Beautiful Soup to get plaintext and URLs from an HTML document?

Question

I was using Python and regular expressions to find things an HTML document and unlike what most people say, it was working perfectly, even though things could go wrong. Anyway, I decided Beautiful Soup would be faster and easier but I don't really know how to make it do what I did with regex, which was fairly easy, but messy.

I am using this page's HTML:

http://www.locationary.com/places/duplicates.jsp?inPID=1000000001

EDIT:

Here is the HTML for the main place:

<tr>
<td class="Large Bold" nowrap="nowrap">Riverside Tower Hotel&nbsp;</td>
<td class="Large Bold" width="100%">80 Riverside Drive, New York, New York, United States</td>
<td class="Large Bold" nowrap="nowrap" width="55">&nbsp;<input name="selectCheckBox" type="checkbox" checked="checked" disabled="disabled" />Yes
</td>
</tr>

Example of the first similar place:

<td class="" nowrap="nowrap"><a href="http://www.locationary.com/place/en/US/New_York/New_York/54_Riverside_Dr_Owners_Corp-p1009633680.jsp" target="_blank">54 Riverside Dr Owners Corp</a></td>
<td width="100%">&nbsp;54 Riverside Dr, New York, New York, United States</td>
<td nowrap="nowrap" width="55">

When my program gets it and I use Beautiful Soup to make it more readable, the HTML comes out a little different than Firefox's "view source"...I don't know why.

These were my regular expressions:

PlaceName = re.findall(r'"nowrap">(.*)&nbsp;</td>', main)

PlaceAddress = re.findall(r'width="100%">(.*)</td>\n<td class="Large Bold"', main)

cNames = re.findall(r'target="_blank">(.*)</a></td>\n<td width="100%">&nbsp;', main)

cAddresses = re.findall(r'<td width="100%">&nbsp;(.*)</td>\n<td nowrap="nowrap" width="55">', main)

cURLs = re.findall(r'<td class="" nowrap="nowrap"><a href="(.*)" target="_blank">', main)

The first two are for the main place and address. The rest are for the information of the rest of the places. After I made these, I decided I only wanted the first 5 results for cNames, cAddresses, and cURLs, because I don't need 91 or whatever it was.

I don't know how to find this kind of information with BS. All I can do with BS is find specific tags and do things with them. This HTML is kind of complicated because all of the info. I want is in tables and the table tags are kind of a mess too...

How do you get that info, and limit it only to the first 5 results or so?

Thanks.

Please include the relevant part of the HTML here for your question to be useful for future readers. — , Aug 10 '12 at 13:44
There is no royal road to HTML parsing. That means that you have to spend some time learning some parser and BeautifulSoup is one of the easier ones. You really can't cheat the task with regular expressions. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 Really. — msw, Aug 10 '12 at 14:29

msw · Accepted Answer · 2012-08-10T17:24:57.717

People say that you can't parse HTML with regular expressions for a reason, but here's a simple reason that applies to your regexp: you've got \n and   in your regexp and those can and will change at random on the page(s) you are trying to parse. When that happens your regexp won't match and your code will stop working.

However the task that you are looking to do is really simple

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('this-stackoverflow-page.html'))

for anchor in soup('a'):
    print anchor.contents, anchor.get('href')

yields all the Anchor tags no matter where they appear in the deeply nested structure of this page. Here are lines I excerpted from the output of that three line script:

[u'Stack Exchange'] http://stackexchange.com
[u'msw'] /users/282912/msw
[u'faq'] /faq
[u'Stack Overflow'] /
[u'Questions'] /questions
[u'How to use Beautiful Soup to get plaintext and URLs from an HTML document?'] /questions/11902974/how-to-use-beautiful-soup-to-get-plaintext-and-urls-from-an-html-document
[u'http://www.locationary.com/places/duplicates.jsp?inPID=1000000001'] http://www.locationary.com/places/duplicates.jsp?inPID=1000000001
[u'python'] /questions/tagged/python
[u'beautifulsoup'] /questions/tagged/beautifulsoup
[u'Marcus Johnson'] /users/1587751/marcus-johnson

It is hard to imagine less code that could do that much work for you.

How to use Beautiful Soup to get plaintext and URLs from an HTML document?

1 Answers1