2

I'm new from python and I'm having some issue in doing a simple thing.

I've an html page and I want to analyze it and grab some links inside a spcific table.

In bash I'd use lynx --source and with grep/cut I'd have no problem..but in Python I dont know how to do it..

I thought to do something like that:

import urllib2

data = urllib2.urlopen("http://www.my_url.com")

Doing it I get the whole html page.

Then I thought to do:

for line in data.read():
    if "my_links" in line:
        print line

But seems it not working

gaggina
  • 5,369
  • 10
  • 31
  • 31

3 Answers3

1

On your code issue, this will read character by character. If you do not pass how much data to read.

for line in data.read():

you could do :

line = data.readline()
while(line):
    print line
    line = data.readline()

This portion is not exactly an answer but I suggest that you use BeautifulSoup.

import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.my_url.com"
data = urllib2.urlopen(url).read()
soup = BeautifulSoup.BeautifulSoup(data)

all_links = soup.find('a')
# you can look for specific link
pyfunc
  • 65,343
  • 15
  • 148
  • 136
  • 1
    +1 for BeautifulSoap. It is often useful to read this answer http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 when trying to extract information from HTML. – Mikko Ohtamaa Dec 06 '11 at 18:41
0

Why don't you use simply enumerate():

site=urllib2.urlopen(r'http://www.rom.on.ca/en/join-us/jobs')

for i,j in enumerate(site):
     if "http://www.ontario.ca" in j: #j is the line
         print i+1 #i is the number start from 0 normally in the html code is 1 the first line so add +1

>>620 
G M
  • 20,759
  • 10
  • 81
  • 84
0

You need Xpath for those purpose in general case. Examples: http://www.w3schools.com/xpath/xpath_examples.asp

Python has beautiful library called lxml: http://lxml.de/xpathxslt.html

Andrey Gubarev
  • 771
  • 4
  • 6