0

I'm trying to extract all links from a given text via recursion. The problem I have is that I want to store links in a list and for whatever reason calling append crashes my code.

def findLink(text, start, *links):
    linkStart = text.find('http', start);
    if linkStart == -1:
        return

    linkEnd = text.find('">', linkStart);
    url = text[linkStart:linkEnd];
    links.append(url);
    findLink(text, linkEnd + 2, links);


source = '''<html xmlns="http://www.w3.org/1999/xhtml">
          <head>
          <title>Udacity</title>
          </head>
          <body>
          <h1>Udacity</h1>
          <p><b>Udacity</b> is a private institution of
          <a href="http://www.wikipedia.org/wiki/Higher_education">higher education founded by</a> <a href="http://www.wikipedia.org/wiki/Sebastian_Thrun">Sebastian Thrun</a>, David Stavens, and Mike Sokolsky with the goal to provide university-level education that is "both high quality and low cost".</p>   
          <p> It is the outgrowth of a free computer science class offered in 2011 through Stanford University. Currently, Udacity is working on its second course on building a search engine. Udacity was announced at the 2012 <a href="http://www.wikipedia.org/wiki/Digital_Life_Design">Digital Life Design</a> conference.</p>      
          </body>
          </html>'''

links = list();
findLink(source, 0, links);

for link in links:
    print(link);
haosmark
  • 1,097
  • 2
  • 13
  • 27

1 Answers1

0

First, two general comments:

  1. You don't need semicolons at the end of the lines.

  2. Don't parse HTML with regular expressions. Python has a convenient xml parser in the standard library.

Now, concerning your question. When you write a function with varargs in the end, like f(a, b, *c), Python makes c a tuple. Tuples are immutable, so they do not have append() method. So you can either transform it to list and then use append(), or go (semi)pure and write links = links + (url,).

Also, the way you call the recursive function later is incorrect. You need to write

findLink(text, linkEnd + 2, *links)

for links to be passed as varargs (will work both for a list and for a tuple). Having said that, there is really no reason to pass it like that, since on a large chunk of HTML it will lead to a lot of arguments passed to a function, and I'm not sure how well Python will handle that. Just pass it normally as a list or a tuple.

Community
  • 1
  • 1
fjarri
  • 9,546
  • 39
  • 49
  • Thanks. I'm very new to this language and following Udacity's class to see if I can pick up some interesting tricks from it. I'm not really sure what you mean by "a lot of arguments". It's the same three arguments, one of which is a reference to an object. You are correct though, it's certainly possible to cause a stack overflow if html size is large enough. I'm just practicing recursion, hence this approach. – haosmark Nov 27 '14 at 04:40