2

I am trying to scrape a web-site for text. Each page contains a link to a next page, i.e. first page has link "/chapter1/page2.html", which has link "/chapter1/page3.html" Last page has no link. I am trying to write a program that accesses url, prints text of the page, searches through the text and finds url to next page and loops until the very last page which has no url. I try to use if, else and return function, but I do not understand where I need to put it.

def scrapy(url):
    result = requests.get(url, timeout=30.0)
    result.encoding = 'cp1251'
    page = result.text
    link = re.findall(r"\bh(.+?)html", page) # finds link to next page between tags
    print("Continue parsing next page!")
    url = link
    print(url)
    return(url)

url = "http://mywebpage.com/chapter1/page1.html"
result = requests.get(url, timeout=30.0)
result.encoding = 'cp1251'
page = result.text
link = re.findall(r"\bh(.+?)html", page)
if link == -1:
   print("No url!")
else:
   scrapy(url)

Unfortunately it doesn't work; it makes only one loop. Could you kindly tell me what I am doing wrong?

user136555
  • 255
  • 2
  • 11
  • [This answer](https://stackoverflow.com/questions/50118298/recursive-promises-not-returning/50121218#50121218) is in javascript but I think it's probably what you're looking for – Mulan Jan 11 '19 at 04:23
  • Thank you, but it doesn't help! I am very inexperienced in programming. – user136555 Jan 11 '19 at 04:59

1 Answers1

0

A couple of things: To be recursive, scrapy needs to call itself. Second, recursive functions need branching logic for a base case and a recursive case. In other words, you need part of your function to look like this (pseudocode):

if allDone
    return
else
    recursiveFunction(argument)

for scrapy, you'd want this branching logic below the line where you find the link (the one where you call re.findall). If you don't find a link, then scrapy is done. If you find a link, then you call scrapy again, passing your newly found link. Your scrapy function will probably need a few more small fixes, but hopefully that will get you unstuck with recursion.

If you want to get really good at thinking in terms of recursion, this book is a good one: https://www.amazon.com/Little-Schemer-Daniel-P-Friedman/dp/0262560992

Jordan Wilcken
  • 316
  • 2
  • 10