-1


I have been coding a webcrawler in python 3, and everything seems to be working.
So I decided to use urllib to get the source code of the pages I am going to crawl.
But I get a name error that says:

    name 'urlib' is not defined

here is my python code:

def get_url(url):
    from urllib.request import urlopen
    source = urllib.request.urlopen(url)
    return source

def getNextTarget(page):
    startLink = page.find("<a href=")
    if startLink == -1:
        return None, 0
    startQuote = page.find('"', startLink)
    endQuote = page.find('"', startQuote + 1)
    url = page[startQuote + 1 : endQuote]
    return url, endQuote

def findAllLinks(page):
while True:
    url, endpos = getNextTarget(page)
    if url:
        print(url)
        page = page[endpos:]
    else:
        break

findAllLinks(get_url("https://xkcd.com/"))

Sorry if this question has already been asked.
Thank you for your help in advance.
P.S: the main prblem is with the get_url() function.

James Deal
  • 13
  • 5

1 Answers1

0

Your get_url function returns a connection object and not a string. So you cannot do a page.find() on it in getNextTarget. You should do a .read() on your connection object to get a string.

Refer:

AttributeError: 'HTTPResponse' object has no attribute 'split' https://docs.python.org/3/library/urllib.request.html

Viral Modi
  • 1,957
  • 1
  • 9
  • 18