3

I'm trying to modify the code of naive bayes classifier provided by the excellent book Programming Collective Intelligence, adapting it to the GAE datastore (the provided code uses pysqlite2). But trying to do it, I'm encountering an error difficult to debug. The error is this:

  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 216, in post
    sampletrain(nb)
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 201, in sampletrain
    cl.train('Nobody owns the water.','good')
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 139, in train
    self.incf(f,cat)
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 71, in incf
    count=self.fcount(f,cat)
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 92, in fcount
    return float(res)
TypeError: float() argument must be a string or a number

The error is in this block:

  def fcount(self,f,cat):
    res = db.GqlQuery("SELECT * FROM fc WHERE feature =:feature AND category =:category", feature = f, category = cat).get()
#    res=self.con.execute(
#      'select count from fc where feature="%s" and category="%s"'
#      %(f,cat)).fetchone()
    if res is None: return 0
    else:
        res = fc.count
        return float(res)
#        return float(res[0])

If I put set_trace() at line 91, like this:

def fcount(self,f,cat):
    res = db.GqlQuery("SELECT * FROM fc WHERE feature =:feature AND category =:category", feature = f, category = cat).get()
    set_trace()
#    res=self.con.execute(
#      'select count from fc where feature="%s" and category="%s"'
#      %(f,cat)).fetchone()
    if res is None: return 0
    else:
        res = fc.count
        set_trace()
        return float(res)

I got this error track:

File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 224, in post
    sampletrain(nb)
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 209, in sampletrain
    cl.train('Nobody owns the water.','good')
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 147, in train
    self.incf(f,cat)
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 77, in incf
    count=self.fcount(f,cat)
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 95, in fcount
    if res is None: return 0
  File "C:\Users\CG\Desktop\Google Drive\Sci&Tech\projects\naivebayes\main.py", line 95, in fcount
    if res is None: return 0
  File "C:\Python27\lib\bdb.py", line 48, in trace_dispatch
    return self.dispatch_line(frame)
  File "C:\Python27\lib\bdb.py", line 67, in dispatch_line
    if self.quitting: raise BdbQuit
BdbQuit

And it's related to the GqlQuery. I'd like to test the code in the Python IDE, printing variables and queries step by step, trying to figure out where the problem is. But when I try to do this in python IDE, I get error messages (like "ImportError: No module named webapp2"). And I'm not so familiar with the program flow to successfully change it. Actually, I tried to do this but got lost: I'm a novice programmer and only recently I started learning OOP). What is the best way to find the error in this case?

The expected answer should include this error identification.

Thanks in advance for any help!

Here the entire code:

#!/usr/bin/env python
#
# Copyright 2007 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# -*- coding: utf-8 -*-
# -*- coding: utf-8 -*-

import os

import webapp2

import jinja2

from jinja2 import Environment, FileSystemLoader

jinja_environment = jinja2.Environment(autoescape=True,
    loader=jinja2.FileSystemLoader(os.path.join(os.path.dirname(__file__), 'templates')))

import random

from google.appengine.ext import db

import re

import math

def set_trace():
import pdb, sys
debugger = pdb.Pdb(stdin=sys.__stdin__,
    stdout=sys.__stdout__)
debugger.set_trace(sys._getframe().f_back)

class fc(db.Model):
    feature = db.StringProperty(required = True)
    category = db.StringProperty(required = True)
    count = db.IntegerProperty(required = True)

class cc(db.Model):
    category = db.StringProperty(required = True)
    count = db.IntegerProperty(required = True)

def getfeatures(doc):
  splitter=re.compile('\\W*')
  # Split the words by non-alpha characters
  words=[s.lower() for s in splitter.split(doc)
          if len(s)>2 and len(s)<20]
  return dict([(w,1) for w in words])

class classifier:
  def __init__(self,getfeatures, filename=None):
    # Counts of feature/category combinations
    self.fc={}
    # Counts of documents in each category
    self.cc={}
    self.getfeatures=getfeatures

#  def setdb(self,dbfile):
#    self.con=sqlite.connect('db_file')
#    self.con=sqlite3.connect(":memory:")
#    self.con.execute('create table if not exists fc(feature,category,count)')
#    self.con.execute('create table if not exists cc(category,count)')

  def incf(self,f,cat):
    count=self.fcount(f,cat)
    if count==0:
      fc_value = fc(feature = f, category = cat, count = 1)
      fc_value.put()
    else:
        update = db.GqlQuery("SELECT count FROM fc where feature =:feature AND category =:category", feature = f, category = cat).get()
        update.count = count + 1
        update.put()
#      self.con.execute(
 #       "update fc set count=%d where feature='%s' and category='%s'"
  #      % (count+1,f,cat))

  def fcount(self,f,cat):
    res = db.GqlQuery("SELECT * FROM fc WHERE feature =:feature AND category =:category", feature = f, category = cat).get()
#    res=self.con.execute(
#      'select count from fc where feature="%s" and category="%s"'
#      %(f,cat)).fetchone()
    if res is None: return 0
    else:
        res = fc.count
        return float(res)
#        return float(res[0])

  def incc(self,cat):
    count=self.catcount(cat)
    if count==0:
      #  self.con.execute("insert into cc values ('%s',1)" % (cat))
      cc_value = cc(category = cat, count = 1)
      cc_value.put()
    else:
        update = db.GqlQuery("SELECT count FROM cc where category =:category", category = cat).get()
        update.count = count + 1
        update.put()
#      self.con.execute("update cc set count=%d where category='%s'"
#                       % (count+1,cat))

  def catcount(self,cat):
#    res=self.con.execute('select count from cc where category="%s"'
 #                        %(cat)).fetchone()
    res = db.GqlQuery("SELECT count FROM cc WHERE category =:category", category = cat).get()
    if res is None: return 0
#    else: return float(res[0])
    else: return float(res)

  def categories(self):
#    cur = self.con.execute('select category from cc');
    cur = db.GqlQuery("SELECT category FROM cc").fetch(999)
    return [d[0] for d in cur]

  def totalcount(self):
   # res=self.con.execute('select sum(count) from cc').fetchone();
    all_cc = db.GqlQuery("SELECT * FROM cc").fetch(999)
    res = 0
    for cc in all_cc:
        count = cc.count
        res+=count
#    res = db.GqlQuery("SELECT sum(count) FROM cc").get()
#    if res==None: return 0
    if res == 0: return 0
#    return res[0]
    return res

  def train(self,item,cat):
    features=self.getfeatures(item)
    # Increment the count for every feature with this category
    for f in features.keys():
##    for f in features:
      self.incf(f,cat)
    # Increment the count for this category
    self.incc(cat)
#    self.con.commit()

  def fprob(self,f,cat):
    if self.catcount(cat)==0: return 0
    # The total number of times this feature appeared in this
    # category divided by the total number of items in this category
    return self.fcount(f,cat)/self.catcount(cat)

  def weightedprob(self,f,cat,prf,weight=1.0,ap=0.5):
    # Calculate current probability
    basicprob=prf(f,cat)
    # Count the number of times this feature has appeared in
    # all categories
    totals=sum([self.fcount(f,c) for c in self.categories()])

    # Calculate the weighted average
    bp=((weight*ap)+(totals*basicprob))/(weight+totals)
    return bp

class naivebayes(classifier):
  def __init__(self,getfeatures):
    classifier.__init__(self, getfeatures)
    self.thresholds={}

  def docprob(self,item,cat):
    features=self.getfeatures(item)
    # Multiply the probabilities of all the features together
    p=1
    for f in features: p*=self.weightedprob(f,cat,self.fprob)
    return p

  def prob(self,item,cat):
    catprob=self.catcount(cat)/self.totalcount()
    docprob=self.docprob(item,cat)
    return docprob*catprob

  def setthreshold(self,cat,t):
    self.thresholds[cat]=t

  def getthreshold(self,cat):
    if cat not in self.thresholds: return 1.0
    return self.thresholds[cat]

  def classify(self,item,default=None):
    probs={}
    # Find the category with the highest probability
    max=0.0
    for cat in self.categories():
      probs[cat]=self.prob(item,cat)
      if probs[cat]>max:
        max=probs[cat]
        best=cat
    # Make sure the probability exceeds threshold*next best
    for cat in probs:
      if cat==best: continue
      if probs[cat]*self.getthreshold(best)>probs[best]: return default
    return best

def sampletrain(cl):
  cl.train('Nobody owns the water.','good')
  cl.train('the quick rabbit jumps fences','good')
  cl.train('buy pharmaceuticals now','bad')
  cl.train('make quick money at the online casino','bad')
  cl.train('the quick brown fox jumps','good')    

class MainHandler(webapp2.RequestHandler):
    def get(self):
        template_values = {"given_sentence":'put a name here'}
        template = jinja_environment.get_template('index.html')
        self.response.out.write(template.render(template_values))

    def post(self):
        nb = naivebayes(getfeatures)
        sampletrain(nb)
        given_sentence = self.request.get("given_sentence")
        spam_result = nb.classify(given_sentence)
        submit_button = self.request.get("submit_button")
        if submit_button:
            self.redirect('/test_result?spam_result=%s&given_sentence=%s' % (spam_result, given_sentence))

class test_resultHandler(webapp2.RequestHandler):
    def get(self):
        spam_result = self.request.get("spam_result")
        given_sentence = self.request.get("given_sentence")
        test_result_values = {"spam_result": spam_result,
                             "given_sentence": given_sentence}
        template = jinja_environment.get_template('test_result.html')
        self.response.out.write(template.render(test_result_values))

app = webapp2.WSGIApplication([('/', MainHandler), ('/test_result', test_resultHandler)],
                              debug=True)
Ihor Kaharlichenko
  • 5,944
  • 1
  • 26
  • 32
craftApprentice
  • 2,697
  • 17
  • 58
  • 86
  • **Hint:** In your code, `fc` is a class; `fc.count` is an instance of `db.IntegerProperty`. `res` holds the results of your query. – icktoofay Aug 12 '12 at 01:30

3 Answers3

8

You've got a type error which should be simple to find but you seem to be making a false choice between running it on a deployment server or in your IDE.

There is a GAE development server which you run locally and it simulates the deployment environment.

Throw away your IDE, run against the development server, use print liberally to ensure that the values are what you expect them to be radiating out from the source of the error.

An IDE is no substitute for understanding what the code is doing, and relying upon one puts a layer of separation between you and your code which only makes debugging more difficult.

msw
  • 42,753
  • 9
  • 87
  • 112
  • Thanks for your wise advice, @msw. Unfortunately, I'm a novice and I didn't understand how to use `dev_appserver.py myapp` command. But I think it would solve my problem. – craftApprentice Aug 12 '12 at 04:29
  • 4
    Use `logging`, not `print`, in a webapp. Trying to print debugging info to stdout is a horrible idea. – Wooble Aug 12 '12 at 04:40
  • Another great advice. I'm reading this now: https://developers.google.com/appengine/articles/logging . I didn't know nothing about logging. @msw, at a glance, you have no idea where it comes from the error in this code? – craftApprentice Aug 12 '12 at 04:46
  • Or use an IDE that actually knows Python and will give you debugging under dev server: http://www.jetbrains.com/pycharm/quickstart/appengine_guide.html – Peter Knego Aug 12 '12 at 07:37
  • Hi, @Peter, I'm using PyScripter and other PyScripter users seem to have a similar doubt: http://stackoverflow.com/questions/789558/how-to-debug-google-app-engine-scripts-with-pyscripter . I'll consider use PyCharm. – craftApprentice Aug 12 '12 at 10:58
  • Hi, @msw, I tried to use logging, but I'm stucked with this: http://stackoverflow.com/questions/11944206/how-to-see-logging-messages-for-debbuging-when-using-pythons-logging-module – craftApprentice Aug 14 '12 at 01:33
2

Depending on the IDE you are going to find it hard to use an integrated debugger from the IDE. Typically they don't have the environment set up to support the appengine runtime (hence the import error you got, you will get other errors too.)

If you include this code at the top of the file you are interested in

def set_trace():
    import pdb, sys
    debugger = pdb.Pdb(stdin=sys.__stdin__,
        stdout=sys.__stdout__)
    debugger.set_trace(sys._getframe().f_back)

and then put a set_trace() any where in you code and run under the dev_server your code will be interrupted at that point and then you can use pdb debugger.

Tim Hoffman
  • 12,976
  • 1
  • 17
  • 29
  • Thanks, @Tim. I tried to use it, but in this case, the error message does not appear to be much more enlightening than the previous one. Maybe I'm not using it correctly. – craftApprentice Aug 12 '12 at 04:40
  • it won't be, however you can now directly interrogate the objects, play with them etc.... I am answering how to debug in appengine using a debugger than you specific question. As to the specifics of the error, you are trying to turn an fc into a float() . Stick the pdb() before that line, and actually see what `res` is using `repr(res)` or something similiar. – Tim Hoffman Aug 12 '12 at 05:08
  • Hi, @Tim Hoffman, Could you give me feedback telling me if I used `set_trace ()` correctly? I edited the question to insert `set_trace ()` information. Thanks, again! – craftApprentice Aug 12 '12 at 10:48
  • You will need to run the dev_server from the command line. I assume you are running it from the app launcher. You will need a console (cmdtool) to get to the interactive debugger. – Tim Hoffman Aug 12 '12 at 10:52
1

From the code you included, I think the error is for this line:

 res = fc.count

You should try use

res = fc.count()

As count is a method of GqlQuery instead of a property.

Wei
  • 718
  • 1
  • 6
  • 18