Easies way to avoid timeout on webscraping app? Heroku Flask app

Question

I'm working on a Flask app where I'm webscraping from multiple URLs (wide range, can be over 100). It works locally, but when deployed to Heroku, it times out.

This is a snippet of the code I'm using, where the bookOutletHas() function is the function that does the webscraping using requests and BeautifulSoup.

for book in gr_books:
    temp_book = book.book
    title = temp_book["title"]
    author = temp_book["authors"]["author"]["name"]
    arr = bookOutletHas(title=title, author=author)
    if arr[0]:
        valid_books += [ str(arr[1]) ]
return render_template("main_page.html",books=valid_books)

My first instinct was to find a way to code it so that it updates the page every time the valid_books array is updated (like rendering the template again each time?) but I'm unsure how to approach this. I don't have any knowledge of javascript so, if possible, I'm seeking an approach through Python and HTML.

@robinsax It times out when fulfilling the POST request. Based on the logs, I see that it successfully scrapes for the first few elements of the array. So I don't think it's the scraping function that is taking too long, but that it has a lot of URLs to hit? — Luiza, May 16 '20 at 22:26

score 0 · Answer 1 · answered May 16 '20 at 22:32

You can configure your environment to allow request handling to take longer. See how to set http request timeout in python flask for a rundown on that.

However, this is a lot more work then you should be doing in a request thread. A better approach is to do the scraping periodically in the background and cache the results, then read from the cache when handling the request.

Easies way to avoid timeout on webscraping app? Heroku Flask app

1 Answers1