0

I am trying to parse some html page in python. When i reach a certain tag, i would like to start printing all the data. So far i came up with this:

class MyHTMLParser(HTMLParser):
    start = False;
    counter = 0;
    def handle_starttag(self,tag,attrs):
        if(tag == 'TBODY'):
            start = True;
            counter +=1
            #if counter == 1
    def handle_data(self,data):
        if (start == True): # this is the error line
            print data

The problem is that there is an error saying that it doesn't know what start is. I know i could use the global, but that wouldn't force me to define the variable outside the whole class?

EDIT: Changing start to self.start solves the problem, but is there a way to define it inside init without messing up the HTMLParser init?

Bartlomiej Lewandowski
  • 10,771
  • 14
  • 44
  • 75

3 Answers3

2
class MyHTMLParser(HTMLParser):
    start = False;
    counter = 0;
    ...

This does not do what you think it does!

In Java, C#, or similar languages, what the analogous code does is declare that the class of objects known as MyHTMLParser all have an attribute start with the initial value of False, and counter with the initial value of 0.

In Python classes are objects too. They have their own attributes, just like every other object. So what the above does in Python is create a class object named MyHTMLParser, with an attribute start set to False and an attribute counter set to 0.1

Another thing to keep in mind is that there is no way whatsoever to make an assignment to a bare name like start = True set an attribute on an object. It always sets a variable named start.2

So your class contains no code that ever sets any attributes on any of your MyHTMLParser instances; the code in the class body is setting attributes on the class object itself, and the code in handle_starttag is setting local variables which are then discarded when they fall out of scope.

Your code in handle_data is reading from a local variable named start (which you never set), for similar reasons. In Python there is no way to read an attribute without specifying in which object to look for it. A bare start is always referring to variable, either in the local function scope or some outer scope. You need self.start to read the start attribute of the self object.

Remember, the def block defining a method is nothing special, it's a function like any other. It's only later, when that function happens to be stored in an attribute of a class object, that the function can be classified as a method. So the self parameter behaves the same as any other parameter, and indeed any other name. It doesn't have to be named self (though that's a wise convention to follow), and it has no special privileges making reads and writes of bare names look for attributes of self.

So:

  1. Don't define your attributes with their initial values in the class block; that's for values which are shared by all instances of the class, not attributes of each instance. Instance attributes can only be initialised once you have a reference to the particular instance; most commonly this is done in the __init__ method, which is called as soon as the object exists.

  2. You must specify in which object you want to read or write attributes. This applies always, in every context. In particular, you will usually refer to attributes inside methods as self.attribute.

Applying that (and eliminating the semicolons, which you don't need in Python):

class MyHTMLParser(HTMLParser):
    def __init__(self):
        start = False
        counter = 0

    def handle_starttag(self, tag, attrs):
        if(tag == 'TBODY'):
            self.start = True
            self.counter += 1

    def handle_data(self, data):
        if (self.start == True):
            print data

1 The methods handle_starttag and handle_data are also nothing more than functions which happen to be attributes of an object that is used as a class.

2 Usually a local variable; if you've declared start to be global or nonlocal then it might be an outer variable. But it's definitely not an attribute on some object you happen to have nearby, even if that other object is bound to the name self.

Ben
  • 68,572
  • 20
  • 126
  • 174
  • i normally use __init__ to declare the variables, but i cannot override the HTMLParser's init, since it does some of its own initialization. – Bartlomiej Lewandowski Oct 31 '12 at 07:45
  • Well in that case the proper initialization of your objects is to initialize the attributes you want them to have *plus* do the initialization `HTMLParser` requires. You will need to write an `__init__` method that does this. It is very common to do this when subclassing. You can call `HTMLParser.__init__(self, ...)` from your `__init__` so you don't have to reimplement it (or use `super`, but I don't really recommend that until you've got a good grasp of the basics of Python). – Ben Oct 31 '12 at 08:12
1

Use the self keyword

class MyHTMLParser(HTMLParser):
    def __init__(self):
        self.start = False;
        self.counter = 0;
    def handle_starttag(self,tag,attrs):
        if(tag == 'TBODY'):
            self.start = True;
            self.counter +=1
            #if counter == 1
    def handle_data(self,data):
        if (self.start == True): # this is the error line
            print data
avasal
  • 14,350
  • 4
  • 31
  • 47
Preom
  • 1,680
  • 3
  • 15
  • 20
0

As a note, you don't need to put a semicolon ; at the end of each line. You can use it as a separator to put multiple statements on the same line if necessary. See Why is semicolon allowed in this python snippet?

Community
  • 1
  • 1
Tim
  • 11,710
  • 4
  • 42
  • 43