1

So I managed to get the page source as a string but my problem is that now I need to parse it, eg. find each instance of a word and save the next few lines in an array.

the text I have looks something like this

<div class="searchResult">
        <table id="ctl00_lp_ctl01_lst" class="searchResultList" cellspacing="0" border="0" style="border-collapse:collapse;">
        <tr>
            <td class="searchResultI">
                <div class="date">
                    13:07
                    &nbsp;&nbsp;
                    17 July
                    </div>
                <div class="sTitle">
                    <a href="www.example1.com/result1">
                        Link Description</a></div>
                <div class="sSubTitle">
                    </div>
            </td>
        </tr><tr>
            <td class="searchResultAI">
                <div class="date">
                    20:07
                    &nbsp;&nbsp;
                    16 July
                    </div>
                <div class="sTitle">
                    <a href="www.example2.com/result2">
                        Link Description<</a></div>
                <div class="sSubTitle">
                    </div>
            </td>
        </tr><tr>

        and so on

and I would like to get the href link and link description and put them in an array. I don't know why this is so trivial for me as I did several parsing projects with other languages. I already searched the web but with nothing helpful.

halfer
  • 19,824
  • 17
  • 99
  • 186
hahaha
  • 1,001
  • 1
  • 16
  • 32

1 Answers1

8

You should not be using regex for parsing HTML. Python comes with lots of parsers for HTML parsing. A good choice here would be Beautiful soup. This is how easy getting href links gets using soup.

import urllib2
from bs4 import BeautifulSoup
url = urllib2.urlopen("http://www.example.com/").read()
soup = BeautifulSoup(url)
for line in soup.find_all('a'):
        print(line.get('href'))
sgp
  • 1,738
  • 6
  • 17
  • 31