Analyze and grab link from an html page

Question

I'm new from python and I'm having some issue in doing a simple thing.

I've an html page and I want to analyze it and grab some links inside a spcific table.

In bash I'd use lynx --source and with grep/cut I'd have no problem..but in Python I dont know how to do it..

I thought to do something like that:

import urllib2

data = urllib2.urlopen("http://www.my_url.com")

Doing it I get the whole html page.

Then I thought to do:

for line in data.read():
    if "my_links" in line:
        print line

But seems it not working

use `data.readlines` then you will have atleast html lines with your links — Anurag Uniyal, Dec 06 '11 at 18:26

pyfunc · Accepted Answer · 2011-12-06T18:36:36.307

1

On your code issue, this will read character by character. If you do not pass how much data to read.

for line in data.read():

you could do :

line = data.readline()
while(line):
    print line
    line = data.readline()

This portion is not exactly an answer but I suggest that you use BeautifulSoup.

import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.my_url.com"
data = urllib2.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(data)

all_links = soup.find('a')
# you can look for specific link

edited Dec 06 '11 at 18:36

answered Dec 06 '11 at 18:24

pyfunc

65,343
15
148
136

1

+1 for BeautifulSoap. It is often useful to read this answer http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 when trying to extract information from HTML. – Mikko Ohtamaa Dec 06 '11 at 18:41

score 0 · Answer 2 · answered Apr 26 '14 at 08:07

Why don't you use simply enumerate():

site=urllib2.urlopen(r'http://www.rom.on.ca/en/join-us/jobs')

for i,j in enumerate(site):
     if "http://www.ontario.ca" in j: #j is the line
         print i+1 #i is the number start from 0 normally in the html code is 1 the first line so add +1

>>620

score 0 · Answer 3 · answered Dec 06 '11 at 18:46

0

You need Xpath for those purpose in general case. Examples: http://www.w3schools.com/xpath/xpath_examples.asp

Python has beautiful library called lxml: http://lxml.de/xpathxslt.html

answered Dec 06 '11 at 18:46

Andrey Gubarev

771
4
6

Analyze and grab link from an html page

3 Answers3