2

I'd like to be able to parse a web page and return any element that has a title containing exactly 4 letters.

For example:

<li><a href="test.com/dogs" title="dogs"></a></li>
<li><a href="test.com/cat" title="cat"></a></li>
<li><a href="test.com/horse" title="horse"></a></li>
<li><a href="test.com/eels" title="eels"></a></li>

In this example, I'd like to return an array containing 'dogs' and 'eels' since the title contains exactly 4 characters. How can I go about doing this? Thanks!

ad2387
  • 551
  • 2
  • 11
  • 21
  • 4
    XML parsers exist. Since you're asking about Python, do a Google search for "beautifulsoup". –  Dec 17 '12 at 17:59
  • 1
    How often must be explain per day that markup should be parsed with HTML or XML parsers and not with anything else? A trillion times? -1 from me –  Dec 17 '12 at 18:00
  • 3
    @user1833746 In the OP's defense, he did ask 'how can I go about doing this', an answer to which would be something like Jack Maney suggested. – RocketDonkey Dec 17 '12 at 18:02
  • http://stackoverflow.com/questions/13903868/python-url-extract-from-html/13903924#13903924 – Abhijit Dec 17 '12 at 18:03
  • You can't parse HTML with regular expressions reliably. http://htmlparsing.com/python.html has examples of how to use a parser. – Andy Lester Dec 17 '12 at 18:24

2 Answers2

6

You should use BeautifulSoup.

Using that, you can do something like this:

import urllib2
from BeautifulSoup import BeautifulSoup

url = # put url here
page = urllib2.urlopen(url)
text = page.read()
page.close()
soup = BeautifulSoup(text)

L = []
for x in soup.findAll('li'):
    link = x.a
    if link.has_key('title'):
        if len(link['title']) == 4:
            L.append(link['title'])
print L
jackcogdill
  • 4,900
  • 3
  • 30
  • 48
  • Dude.. This will work fine. `Beautifulsoup` is just a `.py` file you can import like this: `from BeautifulSoup import BeautifulSoup` – jackcogdill Dec 17 '12 at 18:03
  • For some reason it's not coming back with anything. I've tried doing the 'soup.findAll' on 'li' and 'a', neither will return anything even when I try to print x in the for loop – ad2387 Dec 17 '12 at 18:38
  • Still no luck :/ Comes back with [] – ad2387 Dec 17 '12 at 18:48
  • Are you sure? i tested it by pasting your html code directly into the `soup` string and the output was: `[u'dogs', u'eels']` – jackcogdill Dec 17 '12 at 18:50
  • What's the url that you're using? – jackcogdill Dec 17 '12 at 18:53
  • It's a private link. I think I'm running into login issues due to running it locally. I agree that putting it into the string seems to work. Thanks for your help..I should be able to take it from here. – ad2387 Dec 17 '12 at 19:11
0

I know that parse html with re considered bad virtue, but i do like staright forward approach.

 #!/usr/bin/env python
 import re
 res_array = []
 for line in open('inputdata','r'):
     res = re.findall('title=\"[a-zA-Z][a-zA-Z][a-zA-Z][a-zA-Z]\"',line)
     if res :
         res_array.append(res[0].split('"')[1]) 
 print res_array
Danylo Gurianov
  • 545
  • 2
  • 7
  • 21