Parsing a website with Python

Question

So I managed to get the page source as a string but my problem is that now I need to parse it, eg. find each instance of a word and save the next few lines in an array.

the text I have looks something like this

<div class="searchResult">
        <table id="ctl00_lp_ctl01_lst" class="searchResultList" cellspacing="0" border="0" style="border-collapse:collapse;">
        <tr>
            <td class="searchResultI">
                <div class="date">
                    13:07
                    &nbsp;&nbsp;
                    17 July
                    </div>
                <div class="sTitle">
                    <a href="www.example1.com/result1">
                        Link Description</a></div>
                <div class="sSubTitle">
                    </div>
            </td>
        </tr><tr>
            <td class="searchResultAI">
                <div class="date">
                    20:07
                    &nbsp;&nbsp;
                    16 July
                    </div>
                <div class="sTitle">
                    <a href="www.example2.com/result2">
                        Link Description<</a></div>
                <div class="sSubTitle">
                    </div>
            </td>
        </tr><tr>

        and so on

and I would like to get the href link and link description and put them in an array. I don't know why this is so trivial for me as I did several parsing projects with other languages. I already searched the web but with nothing helpful.

[Don't use regex](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#1732454) — RevanProdigalKnight, Jul 20 '14 at 14:44

score 8 · Accepted Answer · answered Jul 20 '14 at 14:54

8

You should not be using regex for parsing HTML. Python comes with lots of parsers for HTML parsing. A good choice here would be Beautiful soup. This is how easy getting href links gets using soup.

import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.example.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
        print(line.get('href'))

answered Jul 20 '14 at 14:54

sgp

1,738
6
17
31

You should also provide an example that starts with an html string. – John Dvorak Jul 20 '14 at 15:02
thx for the lib and thank you for the example too :D – hahaha Jul 20 '14 at 15:06
1

Parsing HTML with regexes summons CTHULHU! – Thomas Junk Jul 20 '14 at 16:33

Parsing a website with Python

1 Answers1