Python to parse web page for 'title'

Question

I'd like to be able to parse a web page and return any element that has a title containing exactly 4 letters.

For example:

<li><a href="test.com/dogs" title="dogs"></a></li>
<li><a href="test.com/cat" title="cat"></a></li>
<li><a href="test.com/horse" title="horse"></a></li>
<li><a href="test.com/eels" title="eels"></a></li>

In this example, I'd like to return an array containing 'dogs' and 'eels' since the title contains exactly 4 characters. How can I go about doing this? Thanks!

XML parsers exist. Since you're asking about Python, do a Google search for "beautifulsoup". — , Dec 17 '12 at 17:59
How often must be explain per day that markup should be parsed with HTML or XML parsers and not with anything else? A trillion times? -1 from me — , Dec 17 '12 at 18:00
@user1833746 In the OP's defense, he did ask 'how can I go about doing this', an answer to which would be something like Jack Maney suggested. — RocketDonkey, Dec 17 '12 at 18:02
http://stackoverflow.com/questions/13903868/python-url-extract-from-html/13903924#13903924 — Abhijit, Dec 17 '12 at 18:03
You can't parse HTML with regular expressions reliably. http://htmlparsing.com/python.html has examples of how to use a parser. — Andy Lester, Dec 17 '12 at 18:24

jackcogdill · Accepted Answer · 2012-12-17T18:46:46.180

6

You should use BeautifulSoup.

Using that, you can do something like this:

import urllib2
from BeautifulSoup import BeautifulSoup

url = # put url here
page = urllib2.urlopen(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)

L = []
for x in soup.findAll('li'):
    link = x.a
    if link.has_key('title'):
        if len(link['title']) == 4:
            L.append(link['title'])
print L

edited Dec 17 '12 at 18:46

answered Dec 17 '12 at 18:02

jackcogdill

4,900
3
30
48

Dude.. This will work fine. `Beautifulsoup` is just a `.py` file you can import like this: `from BeautifulSoup import BeautifulSoup` – jackcogdill Dec 17 '12 at 18:03
For some reason it's not coming back with anything. I've tried doing the 'soup.findAll' on 'li' and 'a', neither will return anything even when I try to print x in the for loop – ad2387 Dec 17 '12 at 18:38
Still no luck :/ Comes back with [] – ad2387 Dec 17 '12 at 18:48
Are you sure? i tested it by pasting your html code directly into the `soup` string and the output was: `[u'dogs', u'eels']` – jackcogdill Dec 17 '12 at 18:50
What's the url that you're using? – jackcogdill Dec 17 '12 at 18:53
It's a private link. I think I'm running into login issues due to running it locally. I agree that putting it into the string seems to work. Thanks for your help..I should be able to take it from here. – ad2387 Dec 17 '12 at 19:11

score 0 · Answer 2 · answered Dec 17 '12 at 18:38

I know that parse html with re considered bad virtue, but i do like staright forward approach.

 #!/usr/bin/env python
 import re
 res_array = []
 for line in open('inputdata','r'):
     res = re.findall('title=\"[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]\"',line)
     if res :
         res_array.append(res[0].split('"')[1]) 
 print res_array

Python to parse web page for 'title'

2 Answers2