Checking ALL links within links from a source HTML, Python

Question

My code is to search a Link passed in the command prompt, get the HTML code for the webpage at the Link, search the HTML code for links on the webpage, and then repeat these steps for the links found. I hope that is clear.

It should print out any links that cause errors.

Some more needed info:

The max visits it can do is 100. If a website has an error, a None value is returned.

Python3 is what I am using

eg:

s = readwebpage(url)... # This line of code gets the HTML code for the link(url) passed in its argument.... if the link has an error, s = None.

The HTML code for that website has links that end in p2.html, p3.html, p4.html, and p5.html on its webpage. My code reads all of these, but it does not visit these links individually to search for more links. If it did this, it should search through these links and find a link that ends in p10.html, and then it should report that the link ending with p10.html has errors. Obviously it doesn't do that at the moment, and it's giving me a hard time.

My code..

    url = args.url[0]
    url_list = [url]
    checkedURLs = []
    AmountVisited = 0
    while (url_list and AmountVisited<maxhits):
        url = url_list.pop()
        s = readwebpage(url)
        print("testing url: http",url)                  #Print the url being tested, this code is here only for testing..
        AmountVisited = AmountVisited + 1
        if s == None:
            print("* bad reference to http", url)
        else:
            urls_list = re.findall(r'href="http([\s:]?[^\'" >]+)', s) #Creates a list of all links in HTML code starting with...
            while urls_list:                                          #... http or https
                insert = urls_list.pop()            
                while(insert in checkedURLs and urls_list):
                    insert = urls_list.pop()
                url_list.append(insert)
                checkedURLs = insert

Please help :)

Hi Shawn, why don’t you take a look at the http://stackoverflow.com/tour first :) — xrisk, Jun 28 '15 at 04:14
I did Rishav, I can't seem to understand why my code doesn't search links found in the HTML... — Shawn, Jun 28 '15 at 04:22
Shawn, not trying to be rude, but your question looks like a mess, and unless you clean it up, nobody will _want_ to help you. Use the formatting tools. Code should be inside backticks ` like `this is code`. All your code should be _here_ and not on OneDrive. Clean up your question, I’ll help you. — xrisk, Jun 28 '15 at 04:24
You are attempting to parse HTML with regexes. That is an unpardonable sin. — xrisk, Jun 28 '15 at 04:29
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — xrisk, Jun 28 '15 at 04:31
Shawn you should take a look into something called BeautifulSoup. It is a Python module designed **specifically** for parsing HTML. — xrisk, Jun 28 '15 at 04:35
Hm. You should mention that in your question. And Im afraid I don’t know regexes. I can’t help you. But are you sure your readwebpage is working? — xrisk, Jun 28 '15 at 04:39
@Shawn specifically mention that you have to use regexes, otherwise people will down vote this. — xrisk, Jun 28 '15 at 04:44
@Shawn your regex is wrong. Running it against google’s page gave me links like `://www.google.co.in/imghp?hl=en&tab=wi` — xrisk, Jun 28 '15 at 04:46
I don't have to use regex. Just anyway I can figure out is acceptable. — Shawn, Jun 28 '15 at 04:48

score 1 · Answer 1 · answered Jun 28 '15 at 05:38

Here is the code you wanted. However, please, stop using regexes for parsing HTML. BeautifulSoup is the way to go for that.

import re
from urllib import urlopen

def readwebpage(url):
  print "testing ",current     
  return urlopen(url).read()

url = 'http://xrisk.esy.es' #put starting url here

yet_to_visit= [url]
visited_urls = []

AmountVisited = 0
maxhits = 10

while (yet_to_visit and AmountVisited<maxhits):

    print yet_to_visit
    current = yet_to_visit.pop()
    AmountVisited = AmountVisited + 1
    html = readwebpage(current)


    if html == None:
        print "* bad reference to http", current
    else:
        r = re.compile('(?<=href=").*?(?=")')
        links = re.findall(r,html) #Creates a list of all links in HTML code starting with...
        for u in links:

          if u in visited_urls: 
            continue
          elif u.find('http')!=-1:
            yet_to_visit.append(u)
        print links
    visited_urls.append(current)

Tried it out, it doesn't find the link ending in p10.html unfortunately. I suppose that could be due to a difference in our readwebpage, i'm not sure. Other than BeautifulSoup and regex, could you suggest another way I find the links in the HTML? I appreciate your help — Shawn, Jun 28 '15 at 07:02
@Shawn if you’re there, then the output produced by my code will help me debug it :D — xrisk, Jun 28 '15 at 12:29

NightShadeQueen · Answer 2 · 2015-06-28T12:52:23.493

I suspect your regex is part of your problem. Right now, you have http outside your capture group, and [\s:] matches "some sort of whitespace (ie \s) or :"

I'd change the regex to: urls_list = re.findall(r'href="(.*)"',s). Also known as "match anything in quotes, after href=". If you absolutely need to ensure the http[s]://, use r'href="(https?://.*)"' (s? => one or zero s)

EDIT: And with actually working regex, using a non-greedly glom: href=(?P<q>[\'"])(https?://.*?)(?P=q)'

(Also, uh, while it's not technically necessary in your case because re caches, I think it's good practice to get into the habit of using re.compile.)

I think it's awfully nice that all of your URLs are full URLs. Do you have to deal with relative URLs at all? `

This regex will unfortunately fail. Try running it on google.com — xrisk, Jun 28 '15 at 04:48

score 0 · Answer 3 · answered Jun 28 '15 at 05:29

Not Python but since you mentioned you aren't tied strictly to regex, I think you might find some use in using wget for this.

wget --spider -o C:\wget.log -e robots=off -w 1 -r -l 10 http://www.stackoverflow.com

Broken down:

--spider: When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
-o C:\wget.log: Log all messages to C:\wget.log.
-e robots=off: Ignore robots.txt
-w 1: set a wait time of 1 second
-r: set recursive search on
-l 10: sets the recursive depth to 10, meaning wget will only go as deep as 10 levels in, this may need to change depending on your max requests
http://www.stackoverflow.com: the URL you want to start with

Once complete, you can review the wget.log entries to determine which links had errors by searching for something like HTTP status codes 404, etc.

Checking ALL links within links from a source HTML, Python

3 Answers3