6

This is a beautifulsoup procedure that grabs content within all <p> html tags. After grabbing content from some web pages, I get an error that says maximum recursion depth exceeded.

def printText(tags):
    for tag in tags:
        if tag.__class__ == NavigableString:
            print tag,
        else:
            printText(tag)
    print ""
#loop over urls, send soup to printText procedure

The bottom of trace:

 File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 16, in printText
    printText(tag)
  File "web_content.py", line 13, in printText
    if tag.__class__ == NavigableString:
RuntimeError: maximum recursion depth exceeded in cmp
yayu
  • 7,758
  • 17
  • 54
  • 86

3 Answers3

5

Your printText() calls itself recursively if it encounters anything other than a NavigableString. This includes subclasses of NavigableString, such as Comment. Calling printText() on a Comment iterates over the text of the comment, and causes the infinite recursion you see.

I recommend using isinstance() in your if statement instead of comparing class objects:

if isinstance(tag, basestring):

I diagnosed this problem by inserting a print statement before the recursion:

print "recursing on", tag, type(tag)
printText(tag)
Leonard Richardson
  • 3,994
  • 2
  • 17
  • 10
1

You probably hit a string. Iterating over a string yields 1-length strings. Iterating over that 1-length string yields a 1-length string. Iterating over THAT 1-length string...

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
  • Could you explain. This is a sample output from a previous url brfore it crashes. "And what used to be a two-month process is for many companies now a five-day process. The problem with raising your 1 to 2 million on convertible..." this contains strings, as well as 1-length strings. – yayu Apr 12 '12 at 06:08
  • Which part don't you understand, the iterating or the iterating? Of course, this all depends on you understanding how the code works. – Ignacio Vazquez-Abrams Apr 12 '12 at 06:09
  • 1
    Can you clarify what it means that "you hit a string"? What do you mean "hit"? Isn't the entire HTML document that is parsed by Beautiful Soup into a DOM initially a string - and aren't the tags themselves strings of characters? We have run into the same error simply trying to substitute values into a simple HTML page with 12 anchors but it is unclear what is triggering recursion. – Praxiteles Jan 07 '16 at 10:55
  • @Praxiteles: A DOM document is made up of nodes. Some of the nodes are tags, and some are text. If you attempt to recurse on text, i.e. a string, you will recurse forever since iterating over a non-empty string yields at least 1 string. – Ignacio Vazquez-Abrams Jan 07 '16 at 11:02
1

I had the same problem. If you have nested tags with a depth of about 480 levels, and you want to convert this tag to string/unicode, you get the RuntimeError maximum recursion depth reached. Every level needs two nested method calls and soon you hit the default of 1000 nested python calls. You can raise this level, or you can use this helper. It extracts all text from the html and displays it in a pre-environment:

def beautiful_soup_tag_to_unicode(tag):
    try:
        return unicode(tag)
    except RuntimeError as e:
        if not str(e).startswith('maximum recursion'):
            raise
        # If you have more than 480 level of nested tags you can hit the maximum recursion level
        out=[]
        for mystring in tag.findAll(text=True):
            mystring=mystring.strip()
            if not mystring:
                continue
            out.append(mystring)
        return u'<pre>%s</pre>' % '\n'.join(out)
guettli
  • 25,042
  • 81
  • 346
  • 663