1

I'm trying to understand BeautifulSoup and tried want to find all the links within facebook.com and iterate each and every link within it...

Here is my code...it works fine but once it finds Linkedin.com and iterates over it, it get stuck at a point after this URL - http://www.linkedin.com/redir/redirect?url=http%3A%2F%2Fbusiness%2Elinkedin%2Ecom%2Ftalent-solutions%3Fsrc%3Dli-footer&urlhash=f9Nj

When I run Linkedin.com separately, I don't have any problem...

Could this be a limitation within my operating system..Im using Ubuntu Linux...

import urllib2
import BeautifulSoup
import re
def main_process(response):
    print "Main process started"
    soup = BeautifulSoup.BeautifulSoup(response)
    limit = '5'
    count = 0
    main_link = valid_link =  re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$","http://www.facebook.com")
    if main_link:
        main_link = main_link.group(1)
    print 'main_link = ', main_link
    result = {}
    result[main_link] = {'incoming':[],'outgoing':[]}
    print 'result = ', result
    for link in soup.findAll('a',href=True):
        if count < 10:
            valid_link =  re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$",link.get('href'))
            if valid_link:
                #print 'Main link = ', link.get('href')
                print 'Links object = ', valid_link.group(1)
                connecting_link = valid_link.group(1)
                connecting_link = connecting_link.encode('ascii')
                if main_link <> connecting_link:
                    print 'outgoing link = ', connecting_link
                    result = add_new_link(connecting_link, result)
                    #Check if the outgoing is already added, if its then don't add it
                    populate_result(result,main_link,connecting_link)
                    print 'result = ', result
                    print 'connecting'
                    request = urllib2.Request(connecting_link)
                    response = urllib2.urlopen(request)
                    soup = BeautifulSoup.BeautifulSoup(response)
                    for sublink in soup.findAll('a',href=True):
                        print 'sublink = ', sublink.get('href')
                        valid_link =  re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$",sublink.get('href'))
                        if valid_link:
                            print 'valid_link = ', valid_link.group(1)
                            valid_link = valid_link.group(1)
                            if valid_link <> connecting_link:
                                populate_result(result,connecting_link,valid_link)
        count += 1      
    print 'final result = ', result
    #    print 'found a url with national-park in the link'

def add_new_link(connecting_link, result):
    result[connecting_link] = {'incoming':[],'outgoing':[]}
    return result

def populate_result(result,link,dest_link):

    if len(result[link]['outgoing']) == 0:
        result[link]['outgoing'].append(dest_link)
    else:
        found_in_list = 'Y'
        try:
            result[link]['outgoing'].index(dest_link)
            found_in_list = 'Y'
        except ValueError:
            found_in_list = 'N'

        if found_in_list == 'N':
            result[link]['outgoing'].append(dest_link)

    return result


if __name__ == "__main__":

    request = urllib2.Request("http://facebook.com")
    print 'process start'
    try:
        response = urllib2.urlopen(request)
        main_process(response)
    except urllib2.URLError, e:
        print "URLERROR"

    print "program ended"
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
user1050619
  • 19,822
  • 85
  • 237
  • 413

1 Answers1

0

The problem is in hanging re.search() on certain URLs on this line:

valid_link = re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$", sublink.get('href'))

For example, it hangs on https://www.facebook.com/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto url:

>>> import re
>>> s = "https://www.facebook.com/campaign/landing.php?placement=pflo&campaign_id=402047449186&extra_1=auto"
>>> re.search("^(https?://(?:\w+.)+\.com)(?:/.*)?$", s)

hanging "forever"...

Looks like, it introduces a Catastrophic Backtracking case that causes regex search to hang.

One solution would be to use a different regex for validating the URL, see plenty of options here:

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • FYI, your regular expression case actually forced me to start a new thread, see http://stackoverflow.com/questions/22650098/very-slow-regular-expression-search – alecxe Mar 26 '14 at 03:01