Wait page to load before getting data with requests.get in python 3

Question

I have a page that i need to get the source to use with BS4, but the middle of the page takes 1 second(maybe less) to load the content, and requests.get catches the source of the page before the section loads, how can I wait a second before getting the data?

r = requests.get(URL + self.search, headers=USER_AGENT, timeout=5 )
    soup = BeautifulSoup(r.content, 'html.parser')
    a = soup.find_all('section', 'wrapper')

The page

<section class="wrapper" id="resultado_busca">

Does this answer your question? [Web-scraping JavaScript page with Python](https://stackoverflow.com/questions/8049520/web-scraping-javascript-page-with-python) — user202729, Dec 05 '21 at 17:33

score 70 · Accepted Answer · answered Aug 02 '17 at 03:56

70

It doesn't look like a problem of waiting, it looks like the element is being created by JavaScript, requests can't handle dynamically generated elements by JavaScript. A suggestion is to use selenium together with PhantomJS to get the page source, then you can use BeautifulSoup for your parsing, the code shown below will do exactly that:

from bs4 import BeautifulSoup
from selenium import webdriver

url = "http://legendas.tv/busca/walking%20dead%20s03e02"
browser = webdriver.PhantomJS()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
a = soup.find('section', 'wrapper')

Also, there's no need to use .findAll if you are only looking for one element only.

answered Aug 02 '17 at 03:56

Vinícius Figueiredo

6,300
3
25
44

23

Update: Selenium support for PhantomJS has been deprecated, you should use headless versions of Chrome or Firefox instead. – Seth Connell Jun 18 '18 at 22:41
1

@SethConnell So how would one tackle this problem? – 3kstc Mar 18 '20 at 02:41
1

@3kstc Try [this](https://stackoverflow.com/questions/48537028/selenium-how-to-use-headless-chrome-on-aws) – demongolem Apr 01 '20 at 22:28
in case you need to wait before fetching HTML code, add `import time` and insert `time.sleep(5)` after `browser.get(url)` – Noname Mar 10 '22 at 05:01
1

@Noname that's totally incorrect. how is that gonna change the request data? – greendino Apr 24 '22 at 09:32
See [this answer](https://stackoverflow.com/a/68787500/1719931) for an updated answer with selenium – robertspierre Jul 01 '23 at 10:23

score 19 · Answer 2 · edited Nov 17 '22 at 23:32

19

I had the same problem, and none of the submitted answers really worked for me. But after long research, I found a solution:

from requests_html import HTMLSession
s = HTMLSession()
response = s.get(url)
response.html.render()

print(response)
# prints out the content of the fully loaded page
# response can be parsed with for example bs4

The requests_html package (docs) is an official package, distributed by the Python Software Foundation. It has some additional JavaScript capabilities, like for example the ability to wait until the JS of a page has finished loading.

The package only supports Python Version 3.6 and above at the moment, so it might not work with another version.

edited Nov 17 '22 at 23:32

pppery

3,731
22
33
46

answered May 31 '21 at 18:28

Enoch

344
2
6

1

How can we add wait time in it. is there any way? – Ibtsam Ch Sep 12 '21 at 09:43
6

@IbtsamCh Yes! There are two ways: use the `wait` argument in render to add a wait time in seconds **before the javascript is rendered** and use the `sleep` argument to add a wait in seconds time **after the js has rendered**. Both arguments only accept integer values. Example: `response.html.render(wait=2, sleep=3)` _waits 2 secs before and 3 secs after the javascript has rendered._ – Enoch Oct 26 '21 at 20:40
I get this message: There is no current event loop in thread 'Dummy-1' – Marlowe Feb 22 '22 at 09:17
@Marlowe me too. theres no current loop in thread – greendino Apr 24 '22 at 09:40
1

I had to do `print(response.text)` for it to actually print anything – mjr May 16 '23 at 21:13
After adding wait and sleep, it still did not get the dynamic rendered content ( render from API ) – Sagar Davara Jul 24 '23 at 07:00

Zuku · Answer 3 · 2021-08-14T22:49:14.850

Selenium is good way to solve that, but accepted answer is quite deprecated. As @Seth mentioned in comments headless mode of Firefox/Chrome (or possibly other browsers) should be used instead of PhantomJS.

First of all you need to download specific driver:
Geckodriver for Firefox
ChromeDriver for Chrome

Next you can add path to downloaded driver to system your PATH variable. But that's not necessary, you can also specify in code where executable lies.

Firefox:

from bs4 import BeautifulSoup
from selenium import webdriver

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()

Similarly for Chrome:

from bs4 import BeautifulSoup
from selenium import webdriver    

options = webdriver.ChromeOptions()
options.add_argument('--headless')
# executable_path param is not needed if you updated PATH
browser = webdriver.Chrome(options=options, executable_path='YOUR_PATH/chromedriver.exe')
browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
html = browser.page_source
soup = BeautifulSoup(html, features="html.parser")
print(soup)
browser.quit()

It's good to remember about browser.quit() to avoid hanging processes after code execution. If you worry that your code may fail before browser is disposed you can wrap it in try...except block and put browser.quit() in finally part to ensure it will be called.

Additionally, if part of source is still not loaded using that method, you can ask selenium to wait till specific element is present:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as ec
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

options = webdriver.FirefoxOptions()
options.add_argument('--headless')
browser = webdriver.Firefox(options=options, executable_path='YOUR_PATH/geckodriver.exe')

try:
    browser.get("http://legendas.tv/busca/walking%20dead%20s03e02")
    timeout_in_seconds = 10
    WebDriverWait(browser, timeout_in_seconds).until(ec.presence_of_element_located((By.ID, 'resultado_busca')))
    html = browser.page_source
    soup = BeautifulSoup(html, features="html.parser")
    print(soup)
except TimeoutException:
    print("I give up...")
finally:
    browser.quit()

If you're interested in other drivers than Firefox or Chrome check docs.

score 10 · Answer 4 · answered May 17 '20 at 15:41

10

I found a way to that !!!

r = requests.get('https://github.com', timeout=(3.05, 27))

In this, timeout has two values, first one is to set session timeout and the second one is what you need. The second one decides after how much seconds the response is sent. You can calculate the time it takes to populate and then print the data out.

answered May 17 '20 at 15:41

shekhar chander

600
8
14

5

Setting timeout=None worked for me. https://requests.readthedocs.io/en/master/user/advanced/#timeouts – Andrea Massetti Oct 12 '20 at 15:56
This solved the issue for me! That was the second parameter of timeout. – Hamza Abbad Mar 12 '23 at 04:49
This is totally incorrect for this question. As described [here](https://requests.readthedocs.io/en/latest/user/advanced/#timeouts), the second variable is the "read" timeout, that is, "the number of seconds the client will wait for the server to send a response". In the OP case the response is send in the allotted time, but it contains Javascript that must be lazy loaded. And in any case, if the second variable is not specified, it equals the first by default (the "connect" timeout, or "the number of seconds Requests will wait for your client to establish a connection to a remote machine") – robertspierre Jul 01 '23 at 10:27
Setting timeout as you mentioned, note worked for me – Sagar Davara Jul 24 '23 at 06:59

Ingy Swan · Answer 5 · 2018-09-19T16:48:39.733

5

In Python 3, Using the module urllib in practice works better when loading dynamic webpages than the requests module.

i.e

import urllib.request
try:
    with urllib.request.urlopen(url) as response:

        html = response.read().decode('utf-8')#use whatever encoding as per the webpage
except urllib.request.HTTPError as e:
    if e.code==404:
        print(f"{url} is not found")
    elif e.code==503:
        print(f'{url} base webservices are not available')
        ## can add authentication here 
    else:
        print('http error',e)

edited Sep 19 '18 at 16:48

answered Sep 19 '18 at 16:03

Ingy Swan

99
1
5

6

Didn't make a differenceforo me. I received a 200 back with the skeletal html structure, but the main div was not populated with the data it would have had had a web browser been used. – demongolem Apr 02 '20 at 10:28
Does this process Javascript? – robertspierre Jul 01 '23 at 10:41

Sonia · Answer 6 · 2021-05-11T12:42:49.703

-5

Just to list my way of doing it, maybe it can be of value for someone:

max_retries = # some int
retry_delay = # some int
n = 1
ready = 0
while n < max_retries:
  try:
     response = requests.get('https://github.com')
     if response.ok:
        ready = 1
        break
  except requests.exceptions.RequestException:
     print("Website not availabe...")
  n += 1
  time.sleep(retry_delay)

if ready != 1:
  print("Problem")
else:
  print("All good")

edited May 11 '21 at 12:42

answered Jun 16 '20 at 13:00

Sonia

362
3
6

1

This code will not load all js based dynamic contents – Mahabubur Rahaman Melon Aug 19 '22 at 03:29

Wait page to load before getting data with requests.get in python 3

6 Answers6

Linked

Related