I am trying to scrape a web-site for text. Each page contains a link to a next page, i.e. first page has link "/chapter1/page2.html", which has link "/chapter1/page3.html" Last page has no link. I am trying to write a program that accesses url, prints text of the page, searches through the text and finds url to next page and loops until the very last page which has no url. I try to use if, else and return function, but I do not understand where I need to put it.
def scrapy(url):
result = requests.get(url, timeout=30.0)
result.encoding = 'cp1251'
page = result.text
link = re.findall(r"\bh(.+?)html", page) # finds link to next page between tags
print("Continue parsing next page!")
url = link
print(url)
return(url)
url = "http://mywebpage.com/chapter1/page1.html"
result = requests.get(url, timeout=30.0)
result.encoding = 'cp1251'
page = result.text
link = re.findall(r"\bh(.+?)html", page)
if link == -1:
print("No url!")
else:
scrapy(url)
Unfortunately it doesn't work; it makes only one loop. Could you kindly tell me what I am doing wrong?