-1

I'm working on a Flask app where I'm webscraping from multiple URLs (wide range, can be over 100). It works locally, but when deployed to Heroku, it times out.

This is a snippet of the code I'm using, where the bookOutletHas() function is the function that does the webscraping using requests and BeautifulSoup.

for book in gr_books:
    temp_book = book.book
    title = temp_book["title"]
    author = temp_book["authors"]["author"]["name"]
    arr = bookOutletHas(title=title, author=author)
    if arr[0]:
        valid_books += [ str(arr[1]) ]
return render_template("main_page.html",books=valid_books)

My first instinct was to find a way to code it so that it updates the page every time the valid_books array is updated (like rendering the template again each time?) but I'm unsure how to approach this. I don't have any knowledge of javascript so, if possible, I'm seeking an approach through Python and HTML.

Luiza
  • 1
  • 1
  • The request in which the scraping is happening times out? – robinsax May 16 '20 at 22:24
  • @robinsax It times out when fulfilling the POST request. Based on the logs, I see that it successfully scrapes for the first few elements of the array. So I don't think it's the scraping function that is taking too long, but that it has a lot of URLs to hit? – Luiza May 16 '20 at 22:26

1 Answers1

0

You can configure your environment to allow request handling to take longer. See how to set http request timeout in python flask for a rundown on that.

However, this is a lot more work then you should be doing in a request thread. A better approach is to do the scraping periodically in the background and cache the results, then read from the cache when handling the request.

robinsax
  • 1,195
  • 5
  • 9