4

I made a program that extracts the text from a HTML file. It recurses down the HTML document and returns the list of tags. For eg,

input < li >no way < b > you < /b > are doing this < /li >

output ['no','way','you','are'...].

Here is a highly simplified pseudocode for this:

def get_leaves(node):
    kids=getchildren(node)
    for i in kids:
        if leafnode(i):
            get_leaves(i)
        else:
            a=process_leaf(i)
            list_of_leaves.append(a)

def calling_fn():
    list_of_leaves=[] #which is now in global scope
    get_leaves(rootnode)
    print list_of_leaves    

I am now using list_of_leaves in a global scope from the calling function. The calling_fn() declares this variable, get_leaves() appends to this.

My question is, how do I modify my function so that I am able to do something like list_of_leaves=get_leaves(rootnode), ie without using a global variable?

I dont want each instance of the function to duplicate the list, as the list can get quite big.

Please dont critisize the design of this particular pseudocode, as I simplified this. It is meant for another purpose: extracting tokens along with associated tags using BeautifulSoup

Jesvin Jose
  • 22,498
  • 32
  • 109
  • 202
  • 2
    Have you considered a HTML parser like [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/) – Jacob Jul 12 '11 at 08:11
  • 1
    At least from your pseudocode it doesn't seem list_of_leaves is global. – sateesh Jul 12 '11 at 08:15
  • Please don't consider this a criticism -- it's not, but check out the output of this python expression (after importing re, of course): `re.split(r'<.*?>', "< li >no way < b > you < /b > are doing this < /li >")` – Ray Toal Jul 12 '11 at 08:15
  • 2
    I AM using BeautifulSoup. And using regexes for HTML parsing turned out to be a very touchy topic, my account got temporarily suspended due to negative votes – Jesvin Jose Jul 12 '11 at 09:00

5 Answers5

10

You can pass the result list as optional argument.

def get_leaves(node, list_of_leaves=None):
    list_of_leaves = [] if list_of_leaves is None else list_of_leaves
    kids=getchildren(node)
    for i in kids:
        if leafnode(i):
            get_leaves(i, list_of_leaves)
        else:
            a=process_leaf(i)
            list_of_leaves.append(a)

def calling_fn():
    result = [] 
    get_leaves(rootnode, list_of_leaves=result)
    print result

Python objects are always passed by reference. This has been discussed before here. Some of the built-in types are immutable (e.g. int, string), so you cannot modify them in place (a new string is created when you concatenate two strings and assign them to a variable). Instance of mutable types (e.g. list) can be modified in place. We are taking advantage of this by passing the original list for accumulating result in our recursive calls.

For extracting text from HTML in a real application, using a mature library like BeautifulSoup or lxml.html is always a much better option (as others have suggested).

Community
  • 1
  • 1
Imran
  • 87,203
  • 23
  • 98
  • 131
  • 2
    I think you would be better of using : if list_of_leaves is None: list_of_leaves = []. For example the code: x = [] ;a = x or 'test' (assigns test to a). So that would mean the result list (in the calling function) would be still empty and what you are populating is a separate list in the method 'get_leaves' – sateesh Jul 12 '11 at 08:34
  • Does this pass list_of_leaves by reference? Could you explain it? – Jesvin Jose Jul 12 '11 at 09:44
  • @aitchnyu: explanation added. – Imran Jul 12 '11 at 16:58
8

No need to pass an accumulator to the function or accessing it through a global name if you turn get_leaves() into a generator:

def get_leaves(node):
    for child in getchildren(node):
        if leafnode(child):
            for each in get_leaves(child):
                yield each
        else:
            yield process_leaf(child)

def calling_fn():
    list_of_leaves = list(get_leaves(rootnode))
    print list_of_leaves
pillmuncher
  • 10,094
  • 2
  • 35
  • 33
2

Use a decent HTML parser like BeautifulSoup instead of trying smarter than existing software.

2

@pillmincher's generator answer is the best, but as another alternative, you can turn your function into a class:

class TagFinder:
    def __init__(self):
        self.leaves = []

    def get_leaves(self, node):
        kids = getchildren(node)
        for i in kids:
            if leafnode(i):
                self.get_leaves(i)
            else:
                a = process_leaf(i)
                self.list_of_leaves.append(a)
def calling_fn():
    finder = TagFinder()
    finder.get_leaves(rootnode)
    print finder.list_of_leaves

Your code likely involves a number of helper functions anyway, like leafnode, so a class also helps group them all together into one unit.

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
1

As a general question about recursion, this is a good one. It is common to have a recursive function that accumulates data into some collection. Either the collection needs to be a global variable (bad) or it is passed to the recursive function. When collections are passed in almost every language, only a reference is passed so you do not have to worry about space. Someone just posted an answer showing how to do this.

Ray Toal
  • 86,166
  • 18
  • 182
  • 232