0

I'm building a web crawler. some of the the data I input into datastore get saved, others do not get saved and I have no idea what is the problem.

here is my crawler class

class Crawler(object):

    def get_page(self, url):
        try:
            req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"}) #  yessss!!! with the header, I am able to download pages
            #response = urlfetch.fetch(url, method='GET')
            #return response.content
        #except urlfetch.InvalidURLError as iu:
         #   return iu.message
            response = urllib2.urlopen(req)
            return response.read()

        except urllib2.HTTPError as e:
            return e.reason


    def get_all_links(self, page):
         return re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+',page)


    def union(self, lyst1, lyst2):
        try:
            for elmt in lyst2:
                if elmt not in lyst1:
                    lyst1.append(elmt)
            return lyst1
        except e:
            return e.reason

#function that  crawls the web for links starting from the seed
#returns a dictionary of index and graph
    def crawl_web(self, seed="http://tonaton.com/"):
        query = Listings.query() #create a listings object from storage
        if query.get():
            objListing = query.get()
        else:
            objListing = Listings()
            objListing.toCrawl = [seed]
            objListing.Crawled = []

        start_time = datetime.datetime.now()
        while datetime.datetime.now()-start_time < datetime.timedelta(0,5):#tocrawl (to crawl can take forever)
            try:
                #while True:
                page = objListing.toCrawl.pop()

                if page not in objListing.Crawled:
                    content = self.get_page(page)
                    add_page_to_index(page, content)
                    outlinks = self.get_all_links(content)
                    graph = Graph() #create a graph object with the url
                    graph.url = page
                    graph.links = outlinks #save all outlinks as the value part of the graph url
                    graph.put()

                    self.union(objListing.toCrawl, outlinks)
                    objListing.Crawled.append(page)
            except:
                return False

        objListing.put() #save to database
        return True #return true if it works

the classes that define the various ndb Models are in this python module:

import os
import urllib
from google.appengine.ext import ndb
import webapp2

class Listings(ndb.Model):
    toCrawl = ndb.StringProperty(repeated=True)
    Crawled = ndb.StringProperty(repeated=True)

#let's see how this works

class Index(ndb.Model):
    keyword = ndb.StringProperty() # keyword part of the index
    url = ndb.StringProperty(repeated=True) # value part of the index

#class Links(ndb.Model):
 #   links = ndb.JsonProperty(indexed=True)

class Graph(ndb.Model):
    url = ndb.StringProperty()
    links = ndb.StringProperty(repeated=True)

it used to work fine when I had JsonProperty in place of StringProperty(repeated=true). but JsonProperty is limited to 1500 bytes so I had an error once.

now, when I run the crawl_web member function, it actually crawls but when I check datastore it's only the Index entity that is created. No Graph, no Listing. please help. thanks.

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
blitzblade
  • 126
  • 1
  • 9
  • You can always add temporarily some `logging.debug()` calls right after your `.put()` calls, at least to take out the guesswork - is execution even getting there? And I'd suggest a permanent `logging.error()` or `logging.exception()`message in your `except:` statement. – Dan Cornilescu Nov 26 '15 at 13:14
  • ok thanks for the suggestion. I'll try that right away – blitzblade Nov 26 '15 at 13:28
  • Notice your try/ bare except in crawl web. any error inside this will effectively fail silently and you won't have anything saved. – Tim Hoffman Nov 27 '15 at 06:41
  • actually I've changed it to `except Exception, e: logging.exception(e) return False` @TimHoffman. Also I added these lines beneath all `.put()` : `logging.basicConfig(filename='datastore.log', level=logging.DEBUG) logging.debug('the graph was put')` as an attemt to log, but no log file was created. I don't know if that's the right way to log tho @DanCornilescu – blitzblade Nov 27 '15 at 09:14
  • it's not the right way. You don't have to name the file. see the docs. Just import it and use it. https://cloud.google.com/appengine/docs/python/requests#Python_Logging for dev_appserver look here:http://stackoverflow.com/questions/2844635/where-does-googleappenginelauncher-keep-the-local-log-files – Paul Collingwood Nov 27 '15 at 11:09

1 Answers1

1

Putting your code together, adding the missing imports, and logging the exception, eventually shows the first killer problem:

Exception Indexed value links must be at most 500 characters

and indeed, adding a logging of outlinks, one easily eyeballs that several of them are far longer than 500 characters -- therefore they can't be items in an indexed property, such as a StringProperty. Changing each repeated StringProperty to a repeated TextProperty (so it does not get indexed and thus has no 500-characters-per-item limitation), the code runs for a while (making a few instances of Graph) but eventually dies with:

An error occured while connecting to the server: Unable to fetch URL: https://sb':'http://b')+'.scorecardresearch.com/beacon.js';document.getElementsByTagName('head')[0].appendChild(s); Error: [Errno 8] nodename nor servname provided, or not known

and indeed, it's pretty obvious tht the alleged "link" is actually a bunch of Javascript and as such cannot be fetched.

So, essentially, the core bug in your code is not at all related to app engine, but rather, the issue is that your regular expression:

'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

does not properly extract outgoing links given a web page containing Javascript as well as HTML.

There are many issues with your code, but to this point they're just slowing it down or making it harder to understand, not killing it -- what's killing it is using that regular expression pattern to try and extract links from the page.

Check out retrieve links from web page using python and BeautifulSoup -- most answers suggest, for the purpose of extracting links from a page, using BeautifulSoup, which may perhaps be a problem in app engine, but one shows how to do it with just Python and REs.

Community
  • 1
  • 1
Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • man... you're a genius! Haven't tried all that but it really makes a whole lot of sense. First things first, I'll learn how to log properly so as to debug my code more easily... thanks man – blitzblade Nov 29 '15 at 08:18
  • I'm now using beautifulSoup4. seems cool, and it actually works fine when you copy the whole module into the app engine project. After trying to debug and get relevant info by logging, (I changed links to TextProperty too) I realised the "add_page_to_index" function stops executing and hangs. Nothing executes after that. could it be that it is also limited in a way? – blitzblade Nov 29 '15 at 12:35
  • @blitzblade, glad bs4 is helping. As for `add_page_to_index`, since you don't show its code, it's of course impossible for anybody to offer any help. On the principle of "just one question per question" (this one already had two issues: link extraction and the 500-char limit for indexed properties), I'd recommend you accept my answer (click on the checkmark outline to its left) since it's been helpful, and open another Q focusing in `add_page_to_index`, giving its code and the *minimum* amount of other code needed to reproduce your problem. – Alex Martelli Nov 29 '15 at 18:02